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PREFACE TO THE SECOND EDITION 


The most important changes that have been made in the second 
edition of this book are as follows. 

1. The pertinent results and implications of approximately one 
hundred and eighty research studies and publications of recent 
date have been incorporated into the text, and the material has 
been documented in the bibliographies. 

2. Twelve recent tests of major importance have been described 
and evaluated, including such instruments as the Kuder Preference 
Record, the Minnesota Multiphasic Personality Inventory, the 
Bennett Test of Mechanical Comprehension, and the SRA Primary 
Mental Abilities Test for Children. 

3. The chapters on intelligence testing have been radically re- 
organized to approach as closely as possible a chronological pres- 
eatation of the topics treated. In a book of this type a strict 
chronological order does not seem desirable, as it would separate 
closely related material, as, for example, the Terman-McNemar 
Test from the Terman Group Test. 

In addition, quite a number of topics have been treated with a 
somewhat different emphasis and somewhat more fully. These 
include reliability, validity, factor analysis, the implications of 
testing during World War IT, and so forth. The indexing has been 
revised and amplified, with the aim of making it more serviceable, 
and the same applies to all documentation. 

One major feature of the book has, after careful consideration, 
been retained. This is the treatment of general topics having to do 
vith the logic of measurement, such as validity, reliability, and 
ypes of scores, in an early chapter, and the return to many of 

ese topics towards the close of the book. There is an obvious 
rgument for grouping all such material at one point. The reason 
or not doing this is the belief that the student may well be ina 
etter position to grasp the broader significance of the issues in- 
olved after he has dealt with a wide selection of specific applica- 
ions, while at the same time he can hardly be expected to study 
pecific measuring instruments intelligently without a general 
rientation. 


vi PREFACE TO THE SECOND EDITION 


Once again the position taken in the entire treatment is that 
psychological tests are essentially practical tools and they must 
be understood and evaluated as such. This point of view clearly 
has a bearing on all the fundamental issues of the theory of 


measurement, and the attempt has been made to show what its 
implications are in this respect. 


James L. Mursety 
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PREFACE TO THE FIRST EDITION 


My purpose in this book is to present a comprehensive and 
balanced account of the testing movement in psychology, taking 
into consideration its past development, its present status, and 
its future prospects. This has determined the method of treat- 
ment throughout, the selection of topics for consideration, and the 


relative emphasis upon them. 
In my opinion psychological tests must frankly be regarded as 


Practical instruments, and used, evaluated, and interpreted as 
such, Indeed, I believe that its practical orientation has from the 
first kept the testing movement in the path of sanity and realism. 
This, of course, does not mean that it is unrelated to such broad 
Psychological problems as the nature of mental growth, the rela- 
tive effect of hereditary and environmental influences, the nature 
of mental organization, or the issue between associationist and 
Configurationist views. Moreover, the student of psychological test- 
ing must be aware of the relevance of these problems if he is to 
select and use instruments of measurement wisely and to interpret 
their results in an enlightened fashion. As a worker in the field 
of psychology the student may very properly have decided views 
on all such matters, but the testing movement as such does not 
Prejudge them and has no final answers. I cannot, for instance, 
see that it presupposes either a hereditarian or a “mechanistic” 
Position. Many criticisms of mental testing apply, not to the sub- 


ject itself, but to the views of persons prominently associated with 
it. Such criticisms may or may not be well taken, but the issue 
ng and unintelligent par- 


needs to become clear if much confusi 


tisanship is to be avoided. 
If this prevailing point of view, together with the general pur- 


Pose of the book, is kept in mind, the choice of topics and their 
relative emphasis and subordination will become clear, and will, 

hope, appear reasonable. There is, for instance, at present a very 
widespread and intensive interest in factor analysis, yet I have 
Not gone into it in great detail, though I would not be thought to 
deprecate its importance and ultimate promise. The reasons are 
first that its psychometric and psychological bearings do not yet 
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seem to me to have clarified themselves, and second that the gen- 
eral student needs an intelligent comprehension of what is being 
undertaken rather than a detailed account of findings that are 
often conflicting, and of intricate controversies so far rather re- 
mote from practical issues. In the same way, an immense amount 
of work is going on in projective testing, which is far too impor- 
tant to be overlooked. But this again seems to me to be a field for 


special study, so that what the general student needs is a broad: 


orientation rather than a detailed familiarity with the enormous 
mass of interpretative data now available. y 

In choosing specific tests for analysis and discussion, I have in 
the main tried to select instruments which are first-rate examples 
of their type, although there are a few negative instances of tests 
open to very serious criticism. In presenting numerous synoptic 
outlines, my purpose has been to enable the reader to get a fairly 
adequate concrete idea of the tests from the book itself, although 
in the footnotes, bibliographies, and indexes I have tried to pro- 
vide him with facilities by which he can readily expand his ac- 
quaintance with them, and with the literature pertaining to them. 
I have thought it best to abandon the classification of test types 
into the numerous small subdivisions often found in favor of a few 
larger ones. In dealing with intelligence tests, the broad chrono- 
logical perspective on their development, which is now becoming 
possible, has seemed to me to be sufficiently illuminating to de- 
serve emphasis. I have tried to take proper cognizance of the most 
recent work in the field, notably that having to do with the effect 
of preschool environment, foster-home environment, socioeconomic 
factors, and the effect of advanced age upon mentality and test 
performance. Notice also has been taken of the work done in 
mental testing during World War II. The purpose, however, has 
been to treat these subjects, not so much for their own interest 
as for their bearing upon psychometric theory and practice. 

I have definitely decided against including any treatment of 
elementary statistical practice. The reason is that I do not believe 
this properly belongs in a general book on psychological testing. 
If it is included at all, the treatment is bound to be scanty, inade- 
quate, and seriously misleading. It seems to me that any serious 
student of mental testing should be told frankly that he ought to 
be willing to spend the time and effort—and the expenditure need 
not be exorbitant—to understand the statistical concepts and 
techniques involved, either as a collateral part of his study, or as 
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a prerequisite to it. There is only one way to grasp the significance 
of measures of central tendency, or dispersion, or relationship, or 
of the probability distribution, and that is to work with and 
manipulate them for oneself. To speak candidly, I believe that. 
the extremely superficial treatments of these subjects not seldom 
found are likely to do more harm than good, because they produca 
the illusion of understanding without the reality. I make no apolo- 
gies for presenting the subject of psychological testing on the 
assumption that it is a serious one, deserving a serious approach. 


J.L.M. 
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CHAPTER I 


THE GENERAL CHARACTERISTICS OF 
MENTAL TESTS 


Wuat Is A PSYCHOLOGICAL TEST? 


Alfred Binet, who may be considered the originator of modern 
psychological measurement, in answer to his critics, undertook a 
piece of informal investigation which well reveals the nature of 
all mental tests. He invited to his laboratory three teachers who 
were to judge the intelligence of children unknown to them, each 
in any way he pleased. It turned out that each of these teachers 
used substantially the same method. One of them asked the 
children the purposes of canals and sluices. Another showed the 
children some pictures, and requested interpretations and com- 
ments. Another asked about the details of the then recent death 
of King Edward VII. The names of neighborhood streets, the 
proper road to take in order to reach a designated place, whether 
factory walls should be made thick or thin were other typical 
inquiries presented (v. Binet, pp. 182 ff.; Terman, 1916, pP. 
gr ff.) .* 

This in essence was the test 
and followed by his successors. 


ing method developed by Binet 
The teachers utilized the method 


crudely. The questions were special and often had a very local 
and limited reference. They were asked in different ways, so that 
their difficulty varied even when they dealt with the same topics. 
There was no set standard for evaluating and interpreting the 
answers which the children made. As Binet put it, “The teachers 
employed very awkwardly a very excellent method.” But it was 
the only method they could find to use. A properly constructed 
and administered mental test refines, standardizes, and elaborates 
what these teachers did in their attempt to reach an appraisal of 
the mentality of the children they were examining. 

A psychological test, then, is a pattern of stimuli selected and 
organized to elicit responses which will reveal certain psycho- 
logical characteristics în the person who makes them. The psycho- 
logical characteristics in question may be general intelligence, 

* These and similar notations refer to the bibliography at the close of the book. 
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numerical ability, musical talent, mechanical aptitude, aptitude 
for some specific vocation, a certain phase or type of interest, a 
certain type of emotional or personal set such as introversion of 
submissiveness, and so on. The stimuli may be pairs of words to 
be marked as having the same or different meanings, or a colored 
design to be copied by using varicolored blocks, or sentences each 
with some word or words omitted to be filled in so as to make 
sense, or lists of occupations and activities to be marked as liked 
or disliked, and so on. These are all typical examples from actual 
use. 

There are one or two points of terminology which it is well to 
be clear about from the outset. The separate stimulus items—-the 
word pairs, the blocks and the design, the various incomplete 
sentences, the various occupations or activities listed—are usually 
called test items. They are the ultimate constituents out of which 
the test is built. Again, a great many published tests are divided 
into subtests, which usually consist of the same kind of items. 
Thus Army Group Intelligence Examination Alpha, developed 
for the United States Army during World War I, comprises 
ten subtests. It contains a set of brief arithmetical problems, a 
set of questions involving problems of practical judgment or 
common sense, a set of word pairs to be marked as having the 
same or different meanings, a set of disarranged sentences to 
be interpreted, a set of incomplete numerical series to be com- 
pleted, a set of problems which requires the subject to find 
analogies to certain words, a set of information items in mul- 
tiple choice form, and also three other subtests. The intention 
always is to set up an orderly and organized pattern of stimuli 
which will reveal the mental characteristics of the person who 
makes the responses, and also to show how the responses them- 
selves must be interpreted if the mental characteristics they are 
supposed to indicate are to be correctly appraised. Tests so con- 
ceived are clearly a refinement and standardization of what was 
done by Binet’s three teachers. Such is the essential nature of a 
psychological test. 


There are two distinctions which further clarify the matter. 


1. How does a psychological test differ from a psychological 
experiment? 


In externals, at any rate, experiments and tests are often 
very similar. Both of them involve the presentation of stimulus 
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situations and the appraisal of responses. Moreover, the types of 
stimuli used are often pretty much the same. Thus paper and 
pencil mazes have been used both as test items and for experi- 
mental purposes in the psychological laboratory. The difference 
lies in the purpose for which the material is set up. In the maze 
test, a series of mazes of increasing difficulty is presented for solu- 
tion by children ranging in age from three to fourteen years. The 
intention is to reveal certain aspects of intelligence, and notably 
“prudent and considered behavior.” Such is the purpose of the 
test (Porteus, 1915, 1924). In the laboratory, however, mazes have 
been used to study the learning process, and to determine what 
is involved in becoming able to run them correctly and confidently. 
In the same way, series of digits for immediate recall have been 
used both in testing and in experimental work. But in the first 
case the purpose is to reveal the ability of the subject; whereas 
in the second, it is to investigate the process of memory. So, in 
broad terms, the difference is that psychological tests aim to reveal 
the characteristics of persons, and psychological experiments aim 
to reveal the characteristics of mental processes. 

The distinction, of course, is far from absolute. Any adequate 
appraisal of the mentality and personality of a human being 
would obviously call for an understanding of the nature of his 
mental processes. If it is proposed to rate him on general intelli- 
gence, introversion, or interest, the question of what these traits 
really are is certainly involved; and until it is answered, thor- 
oughly satisfactory tests cannot be constructed. 

Moreover, at the present time these two lines of work, the one 
in mental measurement and the other in experimentation, which 
have heretofore been pretty separate, are coming together. In par- 
ticular, workers in the field of psychological testing are becoming 
more and more concerned with the nature of the processes with 
Which they try to deal. With the accumulation of test data has 
come the belief that they ought to throw light not only on the 
characteristics of persons, but on the nature of mental processes 
and the organization of the mind as well. An elaborate and still 
controversial body of techniques known as factor analysis has 
been developed with this latter consideration in mind. One impor- 
tant factorial study has led to the conclusion that performance on 
a certain set of tests involves seven mental processes—numerical 
facility, word fluency, visualization of space, memory for words, 
names and numbers, perceptual speed, verbal reasoning, and induc- 
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tion (Thurstone, 1938). Another outstanding investigator, using 
these techniques, maintains that most mental test performances 
call chiefly for a general factor which seems to consist of a general 
intellectual energy (Spearman, 1927). We shall return to this 
subject more fully later on. But for the moment the point is that 
though a psychological experiment and a psychological test have 
different orientations, one having to do with persons and the other 
with processes, the two lines of work are already converging and 
are likely to come together more and more as time goes on. 


2. How do psychological tests differ from educational tests? 


In general, there is a broad and obvious working distinction 
between tests dealing with mental processes and characteristics 
and those dealing with achievement in school subjects, such as 
reading, arithmetic, spelling, social studies, science, and the like. 
The form is the same in both kind of tests. Items or stimulus 


situations are set up to evoke revealing responses. But the pur- 
Pose is different. 


Here again, however, the distinction is far from absolute. Edu- 
cational tests obviously involve mental Processes. Achievement in 
a school subject may call, at any rate, for memory, and often for 
understanding, insight, the ability to solve problems or to collate 


on the material 
hus it has been 
uch as the well- 


1945). 

The essential distinction, 
ferent kinds of processes, R 
that are general in scope an 
their reference. Also, it turn 


then, is not between two quite dif- 
ather, it is one between mental tests’ 
d educational tests that are specific in 
S upon purpose, the mental test empha- ' 
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and being set up to reveal them in 
terms of the score achieved, and the educational test emphasizing 
attainment in some subject or group of subjects. The point has a 
considerable practical importance. Attempts have often been made 
hich pupils in school are exerting 


to determine the extent to W I | g 
themselves, or working up to OY below their capacities, by com- 


paring their showings on mental tests on the one hand and on 
educational tests on the other. If a child does well on a mental 
test and less well on educational tests, he is supposed to manifest 
a lack of motivation or seriousness, for his achievement is not 
what one might properly expect. This relationship between men- 
tality and attainment has even been reduced to a numerical index, 
called the Accomplishment Quotient, or A.Q., which is the ratio of 
educational age to mental age (Franzen). Tf a child’s educational 
attainment is not on a level with his mentality, this gives him an 
AQ. of less than 100 and suggests that something is wrong. And 
in some schools a marking system has been set up based, not on 
direct achievement in the various subjects, but on educational or 
achievement ratings divided by ratings on mentality. There isa 
good deal to say about this whole idea, but a full discussion will 


have to be postponed. So far, however, this much is clear: it must 
not be supposed that an intelligence rating and an educational 
Mental tests and educa- 


rating are two independent variables. 
tional tests have much in common and differ chiefly in generality 
and purpose. This in itself suggests very decided doubts about 
such methods of treating their results in general, and about the 


Accomplishment Quotient technique in particular. i 
Moreover, the distinction between mental and educational tests, 


never absolute, is tending to become more and more blurred at 
the present day. Educational tests are being devised that are 
increasingly broad and general in their reference. Such tests are 
directed not only to the memory processes and information, but © 
to problematic thinking, the drawing of inferences from data, the 


application of generalizations to specific problems and situations, 


study habits and practices, appreciative insights, and the like. 
Their chief difference from psychological tests proper is that they 
Utilize material from one area of subject matter only—apprecia- 


tive insights in the fine arts or literature, the drawing of inferences 
from the data of chemical experiments, problematic thinking in 
pt to stress psychological 


mathematics, and so on. But they are a 
le of an educational 


processes very heavily. A good recent examp 


sizing psychological processes 
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test which is very general in its character is the battery S Ts 
the United States Armed Forces Institute Tests of General Ec k 
cational Development. These instruments deal with ok 
science, the social studies, and the humanities. ring 6 is 
stressing specific content, they call for the interpretation o as 
Sages of reading material in these fields. As a matter of fact, i 
generality—this tendency to stress psychological processes ra 
than highly specific knowledge and skill—has been made ae 
of criticism against some modern educational tests. It is sai nie 
they gloss over specific mastery in the area of subject eae Wile 
cerned, so that an able person can do well even if he knows A 
about the subject. This is an objection often urged against etd 
pletion tests, which may easily zereal brightness or general inte 
r actual knowledge. : , 

E i poe nk educational aptitude tests, ees 
are borderline cases between the two types here under consi aa 
tion. Thus in the Iowa University Placement mae mination apti- 
tude tests are set up for entering freshmen in a imber o i E S 
The mathematics aptitude test, for instance, consists of number 
series completion, problems calling for spatial imagin 
bolic logic, and the interpretation of new and 
matical reading. It is intended to reve 
mastery of mathematics, but his c 
In effect, one might call it an inti 
mathematical slant. But here ag 
pointed out above that educational achievement tests heavily em- 
phasizing the so-called higher mental Processes may reveal men- 
tality rather than m atter. On the other hand; 
an alleged mathemati i uch as the foregoing may 
matical training, A person 
and promise who had done 
might easily outshine a bril- 
ittle or none. In this case the intention 
of the test—the revelation of aptitude or promise—would be to 
Some extent defeated (Stoddard, 1928). 

he contrast between psychological and 
n connection with that between experi- 


al ability 
a good deal of studying in the field 


, but it is a 
e distinction, in any 
onal tests is pragmatic 


ncy in present-day work. Th 


SS 
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rather than absolute. They differ in purpose and emphasis. They 
differ in generality. If what we want is a rating on mentality, then 
clearly a psychological test is the one to choose, because it is set 
up with this intention and because a person’s performance on it 
can be immediately interpreted in terms of his mental charac- 
teristics. But it is not true that such a test turns upon factors 
wholly or even very largely distinct from those that affect per- 


formance on an educational test. 


Types AND CLASSES OF PSYCHOLOGICAL Tests 


lassified in a number of 


Existing psychological tests can be c 
uperficial, whereas others 


different ways, some of which are quite s 
reveal deep and far-reaching differences. 


1. Psychometric and projective tests 


The most fundamental classification of existing tests is into 
psychometric and projective instruments. There is between them 
a far-reaching distinction both in purpose and in methodology. 

_ The purpose of a psychometric test, as the term itself indicates, 
is to reveal or measure the amount of some mental trait or charac- 
teristic possessed by the subject. The purpose of a projective test, 
on the other hand, is to reveal the quality or type of the subject’s 


Personality. Thus it is not, properly speaking, an instrument of 


Measurement at all. “A projective method for the study of per- 
f a stimulus situation designed 


sonality involves the presentation 0 € 
or chosen so that it will mean to the subject not what the experi- 
menter has arbitrarily decided that it shall mean . . . but rather 
whatever it must mean to the person who gives it, or imposes upon 
it his private idiosyncratic meaning and organization” (Sargeant, 
P. 257). 

With regard to methodology, the psychometric test sets up 
stimulus situations to which definite predetermined values have 
been assigned. Thus the series of numbers 2, 4, 6, 8 is presented, 
and the task is to indicate what the next number ought to be. If 
the response “10” is forthcoming, the subject receives a certain 
designated score. Or a vocabulary list of words in order of increas- 
ing difficulty is given orally, the task being to assign proper mean- 
ings to the various words. If a subject can define twenty of them, 
he comes up to the mental age level of eight. If he can define 
thirty, he comes up to the mental age level of ten. Or lists of 
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i i ersonal feelings and reactions 
bees = Mi ne kta great deal, or dislikes eae 
Age Sea a See Affirmative or negative replies a 
a ca toe vie orting to show the extent of introversion, A 
aA A a ea or the like. This illustrates what is mea 
dominance- 


and, may present as snai 
of pictures, the task being 


y any means 
r in which a 
always con- 


y 0) y i ests. The Manne} 
bsent fr m ps; chometric t ts h n 
a eoad; to an individual in elligence test is 
C. S 


Misunderstanding of thi 
tations: 


t2; Classification by types of process 

A less fundamental but still important and serviceable basis for 
the classification of m tal tests is in terms of the kind of mental 
process they purport to reveal. H 
sally accepted and agreed 
general intelligence, There are t 
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interest. There are tests of personality, dealing with such charac- 
teristics as introversion, sociability, and the like. There are tests 
of attitude towards specific school subjects or specific races, or 
towards any school subject or any race, and so on. There are tests 
which deal with determining values, such as aesthetic or practical 
reactions to life and its problems. There are tests of moral traits, 
such as honesty and fair-mindedness. 

Just what classification in this respect is adopted is not of the 
first importance. The designated mental processes are not well 
defined. A talent is not clearly differentiated from an aptitude. 
Attitudes, values, and moral traits merge into one another. The 
meaning of the various terms is far from clear. All that is needed 
is a grouping of existing tests that offers some convenient frame 
of reference so that it is possible to know fairly well to what a 


Person is referring. 


3. Classification by types of items 

First there are tests that are verbal in content. The stimulus 
Consists of words, including of course mathematical symbols. The 
task is to manipulate words or symbols. The required response is 
verbal. Then there are tests that are prevailingly nonverbal, which 
Present pictures to be interpreted or matched, blocks with which 
to build indicated designs, form boards with holes of various 
Shapes into which the corresponding cutout pieces are to be fitted, 
boards with numerous holes into which pegs are to be placed, and 
So forth. Such tests are not entirely nonverbal, because the in- 
Structions at least are given orally. All of them are sometimes 
lumped together as performance tests, i.e., tests which call for 
Manipulation rather than verbal response. This, however, is not 
Correct usage and leads to serious confusion. Test items of the 
kind described can be used to reveal some general mental trait, 
usually intelligence. In this case we have a true performance test. 
Or such items can be used to reveal manual or mechanical ability, 
and in this case the test should be classified as an aptitude test. 

hen once again there are true nonlanguage tests, which the 
task is to manipulate or compare or arrange objects or follow 
directions, and the instructions are given in some sort of panto- 


mime to avoid the use of speech. 


4. Classification by mode of administration 
In this connection two types of tests are found. First there are 
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individual tests, so called because they are administered to only 
one subject at a time. Such tests have the decided advantage of 
making possible oral as well as written or manipulative responses. 
Outstanding instances are the various revisions of the Binet scale, 
and the Kuhlmann Tests of Mental Development. Then there are 
group tests, so called because they can be given to groups of sub- 
jects simultaneously. One sometimes finds the term scale used as 
though it meant the same thing as individual test. This, however, 
is quite erroneous.-Not all individual tests would be thought of 
as scales. And not all scales are instruments for purely individual 
administration. 

This discussion of classification will serve to explain certain 
terms commonly used in the field of mental testing, and to present 
some idea of the scope of modern mental testing. It is evident 
that all groupings except the first are chiefly pragmatic and do not 
turn on any very clear-cut or fundamental differences, 


LIMITATIONS AND VALUES or MENTAL Tests 


During the past forty years vast ex 
of mental tests has accumulated, and 
investigations regarding them have b 
a broad and firm basis for summ 
tainty both their limitations and 


perience in the application 
almost innumerable research 
een conducted. This provides 
arizing with a good deal of cer- 
their possibilities, 
1. Types of test items 

A good way to approach this question is by a survey of repre- 
sentative types of items that have been and are being used in test 
construction. As Binet himself e 


piece of informa] research des 


indicate general mental characteristics 
S that started the modern testing movement 
: 4 .<€ anything like a complete catalog of such 
items wouid certainly be a very large and difficult undertaking. 
2 €cessary. All that is needed is a survey suf- 
ficient to make their general nature clear. i 
S numbers of tests have been published and 
ng family resemblance among the items out 


siy 


CHARACTERISTICS OF MENTAL TESTS II 


of which they are built. Since the early work of Binet, whose first 
tests were published in 1905, & vast pool of such items has been 
accumulated. Successive test makers have drawn upon this pool 
again and again, often adding minor variations, sometimes pro- 
posing definitely new departures and novelties. Great ingenuity 
has gone into this work. But the general outcome has been in the 
direction of uniformity rather than of wide variety. 

Among the items used in verbal intelligence tests, the following 
are some of the most familiar. j 

Verbal opposites, such as: black—blue, light, white, dark. (Un- 
derline the word among the last four which is opposite in meaning 
to the first.) 

Analogies, such as: shoe—foot; glove—head, wrist, leg, hand. 
(Indicate the word among the last four which has the same rela- 
tion to the third as the second has to the first.) 

Best reasons : Several alleged reasons are listed for a course of 
action, a belief, etc., the task being to indicate the best one. 

Disarranged sentences, such as: turn gas 80 if the I stove off 
out the will. (Interpret correctly.) 

Sentence completion: Sentences W 


to be filled in so as to make sense. 
Proverbs : Familiar proverbs presented for interpretation, which 


may be given orally as in an individual test, or chosen from a 
list of possibilities as in a group test. 

Series Completion: An arithmetical or algebraic series is pre- 
sented, the task being to indicate what the next logical term should 
be. Sometimes the subject must decide on the proper continuation 
directly, and sometimes he must make a choice from several listed 
Possibilities. 


ith a word or words omitted, 


Directions : Instructions to be followed, sometimes in large overt 
in the room in a desig- 


action, such as doing a number of things 
Nated order, as in an individual test ; sometimes with paper and 
Pencil, such as following a chart or a maze course in a designated 
Sequence. 

Information: Items in 
to indicate the extent of th 


known topics or things. ; Ea 
Memory: Memory items have been rather widely used. In indi- 


vidual tests they take the form of digits or sentences with a stated 
number of syllables that the subject is required to repeat ver- 
batim, or paragraphs or episodes of which the subject must give 


“objective” form, usually multiple choice, 
e subject’s information about widely 
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the gist after hearing them read once. Another type of memory 
item is to see how many of the objects in a collection the subject 
can name after looking at them for a given time. Items of this type 
appear with appropriate modifications in a good many group tests. 

Arithmetic problems: The use of such problems, given either 
in verbal or numerical form, is a favorite resource in test con- 
struction. 

Vocabulary: Vocabulary and word knowledge items are con- 
sidered to be of great value. A common procedure is to set up a 
list of graded stimulus words and to see how many of them the 
subject can roughly define, i 

Classification: A stimulus word followed by several alternative 
suggestions as to the classification of the thing indicated, the task 
being to pick out the right one. 

As instances of the type of primarily nonverbal items widely 
used in intelligence tests, the following three are typical. 

Draw a man: The subject draws a schematic picture of a man 
on instructions to do so, and is rated on his showing the proper 
number of limbs, etc., and on indicating perspective and propor- 
tion. This is a novel type of item and has been used in only a very 
few tests. 

Space relationships: See Figure x a, The task is to determine 
what numbers are only in the circle and what number is in all 
three figures. z 

Cube relationships: See Figure 1, The task is to decide how 
many cubes there are in such Piles, 


Aptitude, personality, attitude, and talent tests naturally use a 
much greater variety of types of items. Here are a few typical 
instances. 


a 
gears and pulleys shown in Pictorial form, 


Aesthetic comparison : 


1 i Comparisons of 
designs, literary excerpts, 


) grouped pictures, visual 
musical themes 


» etc., the task being to 
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indicate their relative aesthetic values. The test objects are some- 
times presented in pairs, sometimes in larger groupings to be 
ranked. 

Personal questions: A very large number of types of personal 
questions are used in tests of personality, attitude, value, and 
interest. They have to do with the subject’s opinions about him- 


ft a 
Ora 
aes 


Fic. 1. INSTANCES OF NON-VERBAL INTELLIGENCE Test ITEMS 


is views on various subjects, his inter- 
ests, and so forth. They may be thrown into a form that enables 
im to answer yes or no, or set UP as choices between various 
Stated alternatives, or he may be asked to give numerical indica- 
tions of the strength of his feelings, beliefs, and the like, 
Projective tests usually present specific stimuli but give the 
Subject a great deal of latitude in making any response he desires. 
This, of course, is very far from a complete survey of all the 
types of test items in use. They are literally endless variations and 
Combinations that have been tried out.* But so far as psycho- 


Metric tests are concerned, they all have a common characteristic. 
fuller account of basic types of items 


Self, his likes and dislikes, h 


is * See Pintner, 1931, pP- 183-90 for a 
sed in intelligence testing. 
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All of them are designed and set up in such a manner that the 
responses they elicit can be treated numerically; that is, they can 
be counted so that a total score is possible. This procedure lies at 
the very heart of psychological testing as we know it today, and 
alone makes such testing possible. _ i 


2. Limitations of psychological tests 


It must, however, be abundantly clear that tests built out of 
such items are subject to very grave limitations. In order to gain 
a correct understanding of what such tests can and cannot do, it 
is very necessary to appreciate the nature of these limitations. 

A. They cannot directly reveal a person’s capacity for complex 
and sustained learnings. Such capacity is of the highest impor- 
tance, and it is beyond doubt one of the characteristics of all 
superior achievement. To learn a language well and rapidly, to 
master a science or a branch of mathematics, to become able to 
deal with complex practical situations such as confront the busi- 
nessman, or the strategist, or the surgeon, or the engineer call for 
sustained energy, persistence, and very complex processes of men- 
tal organization. Yet our tests cannot deal with such processes, for 
the simple reason that they cannot be translated into items which 
are numerically manageable and susceptible of being counted. 
Perhaps, as it has been claimed, these processes always exist in a 
person’s behavior in some definite amount. But if there is no way 
of determining what that amount may be, the hypothesis is of 
little aid. 

B. Our tests cannot directly reveal capacity for disentangling 


concepts from complex masses of data. Here again are mental 
processes of the highest import 


central in much of what is best 
the slow and painstaking discoy 
ciples, the working out of Proper operative techniques, decisions 


upon the right line of action in the midst of many and bewildering 


alternatives, and “straight thinking” generally. Yet once more they 


uantitative items. 


A : veal capacity for consi nd 
considered choice between possibl a istent a 
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limited and artificial situations means inevitably that they cannot 
deal with some of the most fundamental aspects of life and 
behavior. 

D. Our tests cannot directly reveal capacity for dealing sensibly 
and wisely with practical problems. A boy may be asked what he 
would do if he found an automobile on the street abandoned and 
unlocked with the keys in it. He may be scored on his response 
to three or four possible alternatives, but his answer may have 
very little relationship to his actual behavior on such an occasion, 
because the essential elements of temptation and opportunity are 
lacking, A man may be asked to assemble ten small objects or to 
trace a complex of mechanical relationships in a diagram, and 
he may ve scored quite definitely on the result. But it would be 
very rash to ask him to adjust the carburetor of one’s car, or to 
repair a radio circuit simply on the basis of his showing. 

E. Our tests cannot reveal directly a person’s capacity for con- 
trolled and effective methods of work. Good methods of work are 
tae of the most decisive of distinguishing marks in a person of 
high achievement and ability. They are usually learned slowly 
and painfully over a periód of many years and as a result of much 
“xperience and many. contacts and suggestions. Moreover, they 
are highly individualized and must be suited to the person con- 
cerned, so that beyond a certain point they cannot be standardized. 

here is no way whatever to reduce such processes to a series of 
test items capable of adding up to a total numerical score. 

F. Our tests cannot directly reveal the depth, strength, and 
Subtlety of a person’s appreciative reactions in ethical, social, or 
aesthetic matters. Such measures as we have looking in these direc- 
tions usually consist either of short questions for self-evaluation 
Or of objects such as pictures or poems to be compared in terms 
Of aesthetic value. It is, however, doubtful whether such value- 
discriminations can be put fully in verbal form; and if they can, 

he expression of them would be very complex and full of quali- 
fications. As to direct comparisons between the aesthetic status of 
Pairs of objects, it is quite clear that the sensitive person will be 
aware of and respond to all sorts of nuances which cannot show 


UP in such choices. ae 

G. Soe a oe tests cannot even begin directly to reveal 
“apacity for producing original ideas and constructions for. initia- 
tive, for the original solution of problems, for creative Sarei 
ndeeq, the type of items used systematically discourages origina!- 
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ity and places every emphasis upon the production of the ape 
“correct” response. Thus it has been pointed out that if a chil 
produced a brilliant but novel definition of a word given in a 
vocabulary test, thus indicating a very high level of mentality, he 
would nevertheless be scored zero on the item because his reply 
did not fit in with the standardized scheme of the instrument. 

In general tests built out of items such as have grown into 
general use cannot directly reveal the “higher mental processes”— 
the very processes most typical of human behavior at its best, on 
which the loftiest distinction and supreme achievement depend. 
By way of answer it may, to be sure, be said that our present psy- 
chometric instruments do in fact indirectly reveal the presence 
and to some degree the excellence of these processes and capacities. 
On this premise, if men like Macaulay, or John Stuart Mill, or 
Darwin, or Leonardo da Vinci could be persuaded to submit to & 
well-chosen battery of tests, they would rank extraordinarily high, 
even though they certainly would not find an opportunity to show 
their power in all its fullness. There is much real force in this 
argument, and it is probably quite sound as far as it goes. Yet it 
does not wholly meet the case; for the fact seems clear that 
because of the nature of their construction and content, without 
which they would be impossible, present-day tests cannot directly 
reveal the operating essence of human mentality at its highest 
levels, or even as it actually functions to produce reasonable suc- 
cess in school, in one’s job, or in one’s everyday social and prac- 
tical activities. Therefore, while it is impossible to agree with 
radical critics who dismiss all psychological testing as a perversion 
and a concentration on the artificial and trivial, it cannot be denied 
that their claims have enough validity to warrant serious attention. 


3. Values of psychological tests 


The other side of the picture is that des 
every judicious student of the subject is b 
modern testing movement has achieved great and indubitable 
successes, both practical and theoretical. 

A. It is in connection with the practical uses of psychological 


tests that the most obvious and unanswe 
r: de- 
jim A able case can be ma 


Pite limitations which 
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Let us, for instance, consider the data presented in Tables 1 
and 2 which, be it noted, were reported a good many years ago. 
Table r shows a close relationship ‘between performance on the 
Stanford Revision of the Binet scale * and educational pros- 
pects, All entering high school freshmen above a certain level on 
the intelligence test complete their course, and nearly all of them 
ha in higher institutions five years after testing. None of those 

_ a certain level are in either categories. Moreover, in the 
classification the relationship becomes still more evident. The 
top quarter of this same group of 107 students had intelligence 
+ ohn ge running from 119 to 142. Of them 100% finished high 
t hool, and 91% continued education beyond it. The lowest quar- 
or of the group had intelligence quotients running from 79 to 97. 

f them 37% finished high school, and 12% continued their 


education beyond it (Proctor, 1925). 


TABLE 1 


S 
TANTORD-BINET I.Q.’s oF ENTERING HIGH ScHoorL Purrts IN RELATION 
TO EDUCATIONAL PROSPECTS 


(After W. M. Proctor, 1925, P. 30 ff.) 


SranrorD-BinEt I.Q.’s OF 107 ENTERING HIGH 
ScHooL PUPILS 


125 plus | 115-124 | 105-114 | 95-194 85-94 | 75-84 


— S 
Peg complet- 
E high school 100 96 83 75 40 o 
Percents in higher 
stitutions five 
28 18 o 


Years after testing 95 86 54 


at of the three individuals 


ii Or again, consider Table 2. Note th 
the brackets of 75-84 I.Q., no one gets an average mark above 


C+; whereas of the four in the bracket of 135 and over, no one 


testa be Stanford Revision of the Binet scale was the first revision of the Binet 
The Ree. at Stanford University under the auspices of L. M. Terman in 1916. 
Unde: cVised Stanford-Binet scale was the second revision, made also at Stanford 

the same auspices in 1937- 
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gets an average mark of less than C-+-. Moreover, if the facts in 
regard to an individual’s intelligence level are known, much can 
be done to help and guide him. Twenty-two students entering 
high school were given the Stanford Revision of the Binet scale in 
the second half of their eighth-grade year and, on the showing 
they made, were given special guidance. Their later work was 
compared to that of 109 unguided high school students comparable 
in intelligence. Of the guided group 18% made one failure in 4 
high school course, whereas 31% of the unguided group made one 
failure. None of the guided group made two or more failures, 
whereas 11% of the unguided group did so (Proctor, 1918). These 
investigations were made more than two decades ago, and they 
have been selected for this very reason. Their findings are typical 
and have been confirmed in substance and amplified time and 
again since then, in many and varied connections. Clearly, then, 


TABLE 2 


STANFoRD-BINET 1.Q.’s or 131 Hic SCHOOL STUDENTS IN RELATION TO 
AVERAGES oF ALL HıcH SCHOOL MARKS 


(Quoted from W. M. Proctor, 1925, Table 5, p. 41) 


AVERAGE OF ' 
ALL HICH - StanForp-Binet I.Q.’s 
ScHooL 
Marks | 73-84 | 85-94 | 95-104 |r05—114 (115-124 125-134 |135 up Totals 
A o o s. 
me z o 3 4 4 I 12 
o 2 5 8 o 
B o 7 I 4 š ; 
Ch . : 9 8 12 7 I 54 
C 5 4 2 o I 18 
I 4 6 3 I 15 
D : ; o o 
= 3 I o o o 10 
o Ž I o 6 g a 2 
Totals 3 22 36 24 eet 
27 15 4 


a mental test score reveals something very well worth knowing» 


even though it is not always decisiv: i 
4 sn e and is h out by 
many admitted limitations../ Ai 


As another example may be cited the large body of work that 
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has been done on the use of psychological tests in connection with 
college entrance. The first necessity. is the selection of a suitable 
test, sufficiently difficult to reveal differences in mental ability 
among the highly selected individuals who present themselves for 
rating, An instance of such a test is the Thorndike Psychological 
Examination for High School Graduates. Experience and investi- 
gation have decisively shown that such an instrument, which takes 
about three hours to administer, will discriminate well between 
those likely to succeed in their college course and those apt to have 
difficulty or to fail (Thorndike, 1920). It will? in fact, predict 
success much better and more reliably than the pattern of subjects 
taken for college preparation, and on the whole just about as well 
as the student’s average mark in all his subjects during his high 
school career. And the test in this case, as pointed out, requires 
but three hours, whereas a high school course takes four years! 
Here is another finding, reported years ago, and consistently con- 
firmed. 

Moreover, when one such test is given year by year for a period 
of years to all students applying for admission, a college can 
determine a critical score below which success is unlikely (Wood). 
Such a score must be determined by the individual college, because 
Standards vary so greatly among different institutions as to make 
the establishment of a general level applicable everywhere impos- 
siblé. The college, however, may admit candidates who fall below 
the critical score if other factors are unusually favorable—if they 
are serious, hard-working, and have a superior character record. 
Thus psychological tests do not tell the whole story by any means. 
But there can be no doubt that they furnish highly important and 
valuable data which have the great advantage of being more or 
less an independent estimate of the persons concerned. 

To take a more recent example, 2 specially designed test, the 
Medical Aptitude Test, was for many years prepared and used 
annually in connection with the admission of candidates in many 
leading medical schools under the auspices of the American Asso- 


ciation of Medical Colleges. Reports and data were exchanged 
etween the Association and the various institutions using the 
t y improved, and better and better 


test, so ? : 
; that it was systematicall t 
í rings it revealed were built up. Its gen- 


interpretation 

s of the show l e r 

eral value is clear from the data summarized in Table 3. As will 
e seen, the scores when divided into decile steps had a very 


finite relationship to average grade in medical school and to the 
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probability of failure. In view of the expense of medical edum 
tion, such prediction is obviously important (Kandel, Moss). z 
fact that within the last year or two the individual schools her 
taken over responsibility for testing admission candidates, so tha 
the nationwide program has been discontinued cannot impair these 
findings or the general significance of the instrument itself. 
Finally, reference must be made to the very extensive use 0 
psychometric tests of many kinds during World War II. The 


TABLE 3 


RELATIONSHIP BETWEEN Test Scores oN MEDICAL Aptitupe Test AND 
ACHIEVEMENT IN MEDICAL SCHOOL 


(Kandel, p. 17) 


Decile Test Score Percent of Failures Average Grade 

—————— 

I 2 85.0 

2 8 82.4 

3 8 81.9 

4 10 81.1 

5 I0 80.8 

6 12 80.3 

7 14 79-8 

8 18 78.5 

9 19 778 

zo 25 ' 76.4 
ee eee O 


varied testing programs in the various branches of the armed 
forces well exemplify both the values and problems of psychologi- 
cal measurement, alike in their development, their procedures, 
their numerous successes, and their occasional failures (v. Davis, 
1943; Guilford, 1943; Stalnaker, 1945). To cite the selection of 
aviation cadets as a specific instance, during the period from 1924 
to 1941 anywhere from 45% to 75% of trainees were rejected at 
Some stage in their training, usually at an early one. A sequential 
testing program, designed to reveal first general fitness, and then 
special qualification for specific duties, was an important factor in 
a this high proportion of waste very markedly (Flanagan, 
1942). 


The guidance counselor, the clinical Psychologist, and the psy- 
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chiatrist, too, find numerous and important uses for psychological 
tests. Such workers employ tests not so much for exact and final 
measurement, but for the refinement of observation. No person 
experienced in work of this kind will make himself the slave of 
test data, But he will utilize tests for the observation and assess- 
ment of behavior under controlled conditions (Cornell and Coxe, 
Rapaport, Gill, and Schaefer). 

So far, then, there is a clear case. Experience and investigation 
have consistently shown that good tests, properly used and conser- 
vatively interpreted, are exceedingly valuable instruments. What- 
ever their limitations, they do in fact provide important informa- 
tion about the capabilities and prospects—vocational, academic, 
and personal—of human beings. Moreover, they do so quickly. Tf 
ìt is possible to obtain in the space of anywhere from 30 to 180 
minutes an estimate of a person or a group of persons, which will 
at least roughly correspond to reality, it is hard to deny that the 
devices for so doing have justified their existence. 

_ What the clinician, or educator, or personnel worker, or voca- 
tional counselor does about the test data when obtained is, of 
course, another story. The evidence yielded by such material is 
only partial. In this respect psychological tests resemble the lab- 
oratory tests of the physician. He would not wish to be without 
them; but once they have been applied, he proceeds in terms of his 
knowledge of the total picture and of his general outlook and 
experience. But that the data are worth having, and within limits 
highly significant, nobody can deny. 

B. Few critics of mental testing have sought to combat such 
claims as these, for their authenticity is altogether too patent. But 
not a few have argued that tests are limited to purely practical 
and ad hoc values. Thus Thomas (q.0:) insists that they are built 
on wholly unsound premises from which they should be freed and 
that they should be considered simply as “engineering instruments 
for purposes of evaluation” (p. 83)- ; 

It certainly seems peculiar at the very least to recommend instru- 
ments as useful, practical tools while saying at the same time that 
they are theoretically quite unsound. Indeed, the suggestion might 


very well be that there must be something unsound about the 
theoretical position of the critic. But the point may be laid aside 
for the moment. s ; 

It is perfectly true, as there will be many occasions to observe, 
that many common assumptions in connection with tests are at 
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least questionable and often untenable. Many confident assertions 
have been made about heredity, environmental influences, the 
process of mental growth, and the distribution of mental abilities 
with the claim that they are logically based on test results and 
so have a valid scientific foundation. Yet quite often such asser- 
tions are indefensible. Psychological testing is relatively new. It 
has been remarkably successful. It has about it an aura of scien- 
tific precision that has led to many rash, brash, and hasty con- 
clusions that go far beyond the established facts. But to dismiss 
psychological testing as nothing more than a pragmatic bag of 
tricks is a very great mistake. It certainly does not reveal with 
any finality or completeness the nature, organization, and action 
of the human mind. Indeed, it throws considerably less light on 
these ultimate questions than a good many people were once 
inclined to suppose. But it does provide a methodology and 4 
growing accumulation of data that the theoretical psychologist 
cannot legitimately ignore. 

Many years of patient research lie ahead, and many lines of 
work must be pushed onward until they converge before the true 
general significance of what tests are revealing becomes apparent. 
But new techniques of analysis are emerging, one conspicuous 
instance being factor analysis. And to deny that tests have any 
theoretical significance or basis because their orientation has bee? 
in the first instance practical and because they have not yet 


uncovered the central mystery of psychology is both untenablé 
and unintelligent. 


MENTAL TESTS AND PSYCHOLOGICAL THEORY 


Critics of mental testing, centering on the limitations which 
have been considered above, have argued that they are due to an 
unsound psychological orientation. The contention is as follows: 

Existing psychometric instruments are said to be based upo? 
the presuppositions of an atomistic or mechanistic psychology: of 
necessity they undertake to isolate and measure separate abilities: 
such as general intelligence, interest, mechanical aptitude socia- 
bility, musical talent, and the like. There seems no other pre: 
cedure, so far as can be seen at the present time. These abilities 
are thought of as independent unitary functions and the indi- 
vidual human mind is, at least by implication regarded as the sum 
total of these units which exist in it in ascertainable amounts: 
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Also, these separate abilities or functions are thought of as operat- 


B in test situations as they do elsewhere. Thus the higher levels 
a. organization and action, such as originality, initiative, 
pen lve power, capacity for continuing and persistent learning, 
The x forth, are composites of simpler and, measurable elements. 
ted ae mental processes are admittedly important, and admit- 
i z 1 ey cannot be tested directly ; but their components can 
eh neasured directly by existing means, at least to a considerable 
“i In effect, the position is that we cannot directly measure 
“ ee characteristics which make up the mind of an Einstein 
fect aphael ; but we can get a set of figures which, within the 
ns nical limitations of our instruments, will authentically repre- 
nt the powers and functioning of such a mind, because its com- 
ae abilities can be tested. Such, it is said, is the psychological 
ory on which psychometric procedure depends. 


This whole viewpoint, however, it is argued, is erroneous. It is 


diametrically opposed to the organismic or holistic or configura- 


Sei psychology coming more and more into prominence, The 
oe al mind is precisely not a composite of unitary traits or 
is ities, but a functioning unit. Intelligence, for instance, cannot 
e separated from interest. What is called musical talent, or artis- 


tic talent, or mechanical aptitude is not a sort of special faculty, 
whole operating in 


hut is essentially the mind or personality as a 
particular way. Moreover, it is false to claim that a person will 


graye in the same manner in a test situation as he does elsewhere. 
‘eel instance, a fairly typical question which occurs in a certain 
est of “practical judgment” is: What is the right thing to do if 
you find you are going to be late for school? The answer that a 
mail makes on the test blank may have little relationship to what 
eh would actually do in such a situation if it really happened, for 
he personality as a whole responds differently in different situa- 


tions, 
t is found expressed with different 


This entire line of argumen 

egrees of completeness in the work of many writers, but it has 

een brought together with particular effectiveness by Thomas 
drawn is that if the criticism 


q.v.). The general conclusion to be 
s which call for free per- 


holds good, then projective instrument c 
nted might be well founded. 


Sonalized reactions to the stimuli prese 

as Psychometric instruments would be founded on sand. Also, 
homas points out that the configurationalist or holistic psychol- 

98y, notably as expressed by John Dewey, has been the basis of 
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the progressive movement in education. So he holds that there is a 
fundamental inconsistency between psychometric testing and the 
whole body of doctrine and practice that goes by the name of 
Progressive Education. z 

Up to a point the argument is convincing. Certain workers in 
the field of mental testing have at any rate seemed to express 
themselves in the language of a so-called atomistic psychology. 
They may be right or they may be wrong, but in so far as they 
have put forward a general viewpoint, it is perfectly legitimate 
to criticize them. The point is that such a viewpoint Is not, as a 
matter of logical necessity, the true basis of psychometric testing. 

The ultimate consideration is that our best tests—and they are 
numerous—really do work. They have validated themselves in 
actual practice, not perfectly to be sure, but quite well enough to 
be of great service. They do not tell the whole story, and those 
who make and use them should always bear this in mind. But 
they do tell an important part of it. Good intelligence tests really 
do indicate capacity for academic education, though they are less 
certain indicators of business and professional success, Medical 
aptitude tests, which are essentially intelligence tests with a medi- 
cal bias, really do foretell success in medical school and during 
internship, though their relationship to a man’s success as a prac- 
ticing physician is another matter. Measures of interest are closely 
related to effectiveness and satisfaction in many vocations. The 
best measures of personality and temperament have real clinical 
value, differentiate well between stable and unstable individual: 
and are of definite service in the diagnosis of mental aberrations: 
All this surely indicates that such instruments must be to some 
extent at least psychologically sound, To argue, as Thomas and 
many critics actually do, that they are theoretically nonsensical 
yet practically useful is a violent anomaly. 

The truth of the matter is this. The attack is made against 
views which may be Premature, or extreme, or incautious, but 
which are essentially irrelevant, and not the necessary basic psy- 
chological theory underlying our psychometric testing, Any test 
must be built about some concept of the thing to be tested. Such 
concepts as general intelligence, neurotic tendency, a prevailingly 
aesthetic outlook upon life, honesty, and mathematical ability are 
typical examples. Tests are constructed to try to measure each of 
them, and if we have no idea what our working concept is, and 
do not isolate it at all, how is it possible to build a test directed 


Ly 


- ra 4 


t; 
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towards it? But there is not the slightest necessity to assume that 
such a concept corresponds to an ultimate or “real” component of 
the mind. It is only a guiding thread, a working hypothesis, set 
up for the sake of a job. Sometimes it seems to prove up. Some- 
times it does not. If our hypothesis yields an effective and work- 
able pattern of test items, and if the resulting test turns out to be 
useful and significant, then the concept about which it is organized 
is to that extent validated. If we really knew how the human 
mind is organized, there is no doubt that we could make much 
better tests, and perhaps they would look quite different from our 
Present instruments. But this knowledge is not available, and the 
best that can be hoped for is the emergence of good working 
hypotheses. 

Thomas and others have complained that the working concepts 
about which tests are set up are merely of the order of “common 
Sense” ideas, Workers in the field of testing have taken over such 
different degrees of intelli- 
Bence, or artistic ability, or assertiveness, or motor coordination, 
tried to define them with at least a little more precision than is 
Ound in their ordinary use, and pease 
around them. Surely, it is said, this means that psychometric 1n- 
Struments involve very superficial psychol 


May startle the layman, but all it means is that if one learns one 
ng and then proceeds to learn another, the second job of learn- 
ng may obliterate the first. Psychoanalytic concepts, of course, 
cem to be of a different order, but they are at least open to 
ronsiderable question and have not been assimilated into general 
Psychological usage. The truth is that mental testing is neither 
D ind nor in front of the general development. of the science of 
*Ychology, It operates at just about the prevailing general level 
Psychological investigation today. It is an important technique 
$ demonstrated value. But those who devote themselves to it 
agente be expected to produce operating concepts enormously in 
vance of our present understanding of mental life. 
ch N all this there is nothing that requires a mechanistic psy- 
ology or that need be unacceptable to adherents of an organismi¢ 
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astronomy, but they certainly 
ty of the cosmos. They do not 
ly offer guide-lines for research 
mental testing at the present 
h better than those of faculty 
in which the practical outcome was 
If critics complain that mental tests 


vant to the work in hand. 
The matter may be put as follows. The testin 
effect, based on the Proposition: Let us think 


N ; 
way, notably by dint of th isti i 

Thus it appears that a 
explained in terms of a relati 
as verbal ability, inducti 
tial thinking, and the like. And att 
struct tests which will reveal such 
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University, Bureau of Publica- 
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gaap E. Vernon, The measurement of abilities (London: The 
niversity of London Press, Ltd., 1940), Chapter 10, “Hints to teach- 
tésis Excellent practical advice which throws light on the nature of 

L. W. Webb and Anna Markt Schotwell, Testing in the elementary 
School (New York: Farrar and Rinehart, Inc., 1939), Chapter 7, 


« 
Uses of tests.” Good practical material. 


QUESTIONS FOR Discussion 

„I. From an examination of sample tests assemble as many differeat 
as of test items as you can find. Does it seem to you that any of 

em might offset the limitations of testing discussed in this chapter? 

2. See if you can invent some test items which might work out in 
the measurement of general intelligence, mechanical aptitude, musi- 
bea ability, or any function you choose. This should give you an 
insight into the nature of tests and their construction and use. 
as Make an outline of the argumen ted by Thomas, par- 
icularly in his fifth chapter. Document his criticisms of testing as 


« ly ent | 
atomistic” by references to the reading 11 
4. Does the position taken J] and Coxe seem to any 
egree to meet the criticisms summa 
oh Have you encountered in genera’ 1 
ere any criticisms of testing not mentioned by Thomas? Can you 


find any replies? . 
le tests. Do you think that their 


6. Look over a number of samp ) 
Jeading? What questionable conclusions 


titles mich $ i 
a ght be in any way mis 
might they suggest? 
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7. Is the fact that tests “work out” evidence in favor of the sound- 
ness of their underlying theory? P 

8. What do you think of the use of memory items in intelligence 
tests? 

9. See if you can find out whether your own institution has any 
data on file regarding the value of psychological tests for entrance, 
guidance, and so forth. Collate and discuss. , 

10. It is suggested that if possible one or more group psychological 
tests be administered in class, and the results used as a basis for dis- 
cussion in studying this and succeeding chapters. 4 


. 


CHAPTER II 


PSYCHOLOGICAL TESTS AS INSTRUMENTS OF 
MEASUREMENT 


Tu CONDITIONS OF MEASUREMENT 


di ed measuring device whatsoever must fulfill four definite con- 
thee if it is to be of service. This is true of foot rules, balances, 
four ometers, speedometers, and also of psychometric tests. The 
lens apes must be met or the device will be misleading and 
the ss. In regard to projective tests the issue is not so clear, for 
in es not considered to be devices for measurement, at least 
fe e ordinary sense, and we shall find that there is some dif- 
ference of opinion regarding the criteria by which they must be 
Judged. As to psychometric tests proper, however, the case is rea- 
Onably clear. The values and limitations of such tests depend far 
ee upon these unavoidable requirements than upon any theo- 
etical orientation. . 
we four conditions in question center on the same thing—the 
ahes ofa measuring device which really measures, which yields 
a entic, dependable, serviceable results within the limits of its 
seplicability and meaning. There are four classes of possible error 
ee can afflict any instrument of measurement whatsoever. But 
t ile it is necessary to consider them one by one, it 1s also impor- 
ant to see that all of them are interrelated and that they affect 


One another in many ways. 
ae All measurement is subject to constant error. Suppose, for 
igs that the workers in a test laboratory wished to secure 
t rves showing the fluctuations in the temperature of an engine 
me operating conditions. They would install a device to take 
wee from the appropriate instrument, namely the thermom- 
er or thermometers placed appropriately in contact with the 
machine. It might be, however, that a clumsy mechanic hooked 
hermometer not placed in con- 
ng the temperature 
tigation, 
the curves would have a very peculiar look. They would have 
Nothing at all to do with engine temperature, 
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a very vague and undetermined relationship to it. And they would 


be quite useless for purposes of investigation, or as a basis for any 
kind of practical decisions. 


In such a case the error is very wild and obvious. Yet errors | 
just as wild, but unfortunately much less obvious, are entirely 
» 


SS 


possible in the field of mental measurement. We might, for in- 
stance, have something called an intelligence test that called for 
nothing but routine memory responses, or for some very special 
knowledge or skill, or that was so easy that everybody scored 
100%, the only difference being the speed at which the subjects 


worked and the time they took to complete the assigned task. Or 
we might have something called a test of 
the subject was called upon to show 
hand drawing of a straight line.* In such instances there would | 
be falsification or error, due to a constant deflecting influence, 
and any set of scor 


l es that might be obtained would be invalid, 
which means that it would not represent the ability supposed t° 
be tested, 


artistic talent in which 
whether he could make a free- 


So any measuring device must be valid. This is its first necessatY 
characteristic. It must measure just what it purports to measure 
and so far as possible nothing else. In other words, it must measure 
without undue constant error or deflection. 


2. All measurement is subject to variable errors. These are errors 
which come from accidents and inaccuracies, and they are due tom 
enter who has ever tried to cut 4 

for a job of work is all too wel)» 


rately, or he may read it wrongly, W ig 
. Wh he ri 
place on the board he is us T eft teh, 


* This is not as impossible as it may s stic ability 
ar actually given by not a few Aa y seem. Just such tests of artistic al 


` hers, i i in an 
published instrument known to ieena although the item is not used i 


al 
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liability. The thing to hope for and to try for is measurement 
accurate enough for the purpose in hand. It must be more accurate 
for a living-room table than for a sawhorse, and more accurate 
still if the job is the construction of an airplane engine. But it will 
never be wholly and completely exact. 

For all measurement is to some extent afflicted with variable 
error. Special instruments and devices are employed to reduce this 
variable error. Perhaps our amateur carpenter invests in a miter 
box, which costs him something, but enables him to feel a good 
deal more confidence when he starts cutting into his wood. If he 
had to get his work accurate to one thousandth or one ten thou- 
sandth of an inch, the miter box would not be much use, and he 
would have to invest in some very expensive instruments indeed. 
He would do so in the interest of accuracy or reliability, and to 
Soa from the “personal equation” and variable errors gen- 
erally. 


The same problem is present in mental measurement. A certain 


test is given to a group of children, and they are ranked on the 
t is given to them, 


result. Then a second equivalent form of the tes c n 
and it is found that there is a certain amount of change in their 


rankings. Which set of results is the right one? There is no way 
he third is adminis- 


of telling ; for if the test has three forms and t 
tered, another slightly varying set of scores will appear. However, 
if the variation is only slight, it may not matter a great deal for 
Practical purposes; but if it turns out to be extreme, then none of 


the scores are usable, and the test itself is condemned as too 
ent must have a 


unreliable for service. So a psychometric instrum ve 
Serviceable degree of reliability. This is its second characteristic. 
It means that variable error, which can come from many sources, 


has been held down to a reasonable and workable degree. 


3. All measurement is subject to personal errors. These, of 
Course, are a type or subclass of chance, or variable, or accidental 
errors, but they are important enough to be considered by them- 
Selves, at least in connection with psycholo 1; 

To return to our amateur carpenter for an illustration, he may 
be on the job of cutting a fourth table leg to match the other 
three. He lays his rule on the wood and makes his markings. But 
he is tired, or shaky, or bored, or something distracts his attention, 


and the result is an error. An element of subjectivity has vitiated 
the measurement, due to the condition of the person who is making 


it. Our carpenter can reduce the chances of such personal errors 
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by using guides and a power saw. Then he is very likely to get all 
four table legs just about the same length. (They will not be abso- 
lutely so because of variations in the instrument, but they will be 
close enough.) Or if he lacks such equipment, he can get somebody 
else to check up on his measurements. In both cases he will secure 
a higher objectivity. 

In mental measurement the chances of personal error are very 
much greater than those with which the carpenter has to contend, 
for one thing because there is quite apt to be a strong element of 
bias. Thus teachers are apt to overestimate the intelligence of 
dull children, and to underestimate the intelligence of bright ones, 
because dull children are usually overage and large for their grade, 
whereas bright ones are usually underage and small, Also, of 
course, personal feelings, liking and disliking, prejudice and 
favoritism constantly disturb estimates. And when it comes to 
assessing a person’s attitudes towards such matters as race °° 
war, or his neurotic tendencies, personal errors are 
threat. Thus good mental tests must be constructe 
same principle as the guides that hel 
be devised to guard as far as possibl 
errors. In other words, 
third of their basic char. 

4. All measuremen 
sider, for example, a 


always a major 
d to embody t 3 
p the carpenter. They mus 
e against besetting persona 
they must be made objective. This is the 
acteristics. 
t is subject to errors of interpretation. Con- 
: map of the world in Mercator projection. 
Unless one is careful, it can lead to most misleading notions, bê- 
cause the units marked off by the lines of latitude and longitude 
change as one goes from the equator to the poles. Australia looks 
much smaller than the United States, yet really it is just about the 


same size. Greenland looks much farther from the North Americ ne 
mainland than it actu 


ally is. Air dista mote po!” 

are distorted. 4 aa ita ° 
The selfsame difficult 
ment. And it exists in 


many people are not aware of it and because the precise formu 
for correction is no 


Mercator projection 


: : ore of 100 double the score of sa 
in terms of what it really means? The second test yields the scor 
of 132 among others. Has this the same meaning as the score ° 


ne 
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132 on the first test? Questions of this kind constantly arise in the 
application and use of mental tests, just as they would do if we 
were making estimates of distance on a Mercator projection map 
of the world and on a globe. 

Now the Mercator projection map is standardized on a known 
principle. That is to say, its units have a definite and ascertainable 
Meaning so that if any one is misled, it is entirely his own fault. So, 
too, there must be some way of assigning to the units of measure- 
ment set up in any psychological test a known and constant sig- 
nificance, They need not be equal, though it is more convenient if 
they are. That is to say, the difference between a score of 80 and 
90 need not be exactly the same as that between a score of 120 and 
another of 130, or the difference between a mental age of eight 
and one of nine need not be equal to that between a mental age of 
twelve and one of thirteen. But the nature of the variation, if 
there is one, must be understood. This is just the situation with 
the Mercator. projection map. Variation is less convenient and 
More risky than equality and is more apt to deceive the unwary. 

3ut so long as we know how to allow for it, there is no fatal objec- 
tion, Clearly, however, the meaning of the units of measurement 
Set up in whatever instrument may be under consideration must 

e known, or it is unusable. That is to say, the instrument must be 
standardized. This is the fourth of its necessary characteristics. 

It is important to understand that the four necessary character- 
istics of any measuring instrument are not independent, so that 
Validity, reliability, objectivity, and standardization are interlock- 
Ing aspects of the same thing, namely, the process of accurate and 
Serviceable measurement, which avoids gross errors of all kinds. 

he balance of this chapter is devoted to considering in some 

€tail the major technical problems involved in these four aspects 
of measurement. This furnishes a necessary background for the 

iscussion of numerous tests and testing procedures presented in 

chapters 3 to 8 inclusive. Then, after this body of concrete mate- 
tial has been set forth, we shall return once more to the tn, par- 
ticularly in the two final chapters of this book, and deal wit eu 
Of the wider issues involved in the concepts of validity, reliability, 
objectivity, and standardization, and in the theory of ae alee 
Benerally. This movement from more specific, and produ ,an 
technical considerations to those that are more Tian com- 
Tends itself as a means for gaining a grasp of the logic z eae 
Ment, its implications, possibilities, and limitations. But as the 
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subject is approached, the first thing to see is that the above four 
characteristics are interrelated, and that they are essential, since 
without them there could be no such thing as measurement. Essen- 
tially, the whole modern testing movement has turned on the 
discovery of ways of dealing with mental processes by instru- 
ments reasonably valid, reliable, and objective, and fairly well 
standardized. It is the fulfillment of these four conditions far more 
than any general theoretical basis which accounts for both the 
values and the limitations of mental tests. 


VALIDITY 


c 4 an actua 
instance, the construction of the well-known Henmon-Nelson Test 


many variations. 
phases of the process. 
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1. A working concept of the function or process to be tested 
must be set up. As we have seen, Henmon and Nelson did not find 
it necessary to undertake a novel or elaborate specification of the 
nature of “general mental ability,” but a certain working idea was 
necessarily involved in all they did. In a case like this the idea 
Would be implicit in what one might call the tradition of making 
general intelligence or general ability tests. Other authors, particu- 
larly when they are originating some departure from standard 
Practice, have undertaken to define their working hypothesis quite 
carefully. Thus Binet considered general intelligence as turning on 
three characteristics—the power to take and maintain a definite 
direction in thinking, the power to make adaptations in order to 
attain a goal, and the power of self-criticism (Terman, 1916, p- 45). 
Thorndike, in building his I.E.R. Intelligence Scale CAVD, thought 
of intelligence as having four attributes. These were as follows: 
(a) Level, or altitude, i.e., the degree of difficulty that a person 
can reach in the performance of mental tasks ; (b) range, i.e., the 
number of different tasks one can perform at a given level of diffi- 
culty; (c) area, i.e., the number of different tasks one can perform 
at all levels of difficulty one is capable of reaching ; (d) speed. 

e considered altitude by far the most important of the four, and 
Proceeded to build his scale with reference to it. The scale in 
question consists of a series of graded tasks of four kinds—com- 
Pletion, arithmetic, vocabulary, and directions—beginning at a 
very easy level and continuing to a point which can be attained 

Y very few human beings (v. Thorndike and Others, 1927). To 
Pass to another field, Voelker (q.v.) proposed to construct a test 
to measure the trait of “trustworthiness.” Evidently he had to 

efine this trait in order to begin assembling relevant test items. 

“or his purposes he regarded trustworthiness as a tendency to 
abide by instructions without supervision or checkup. These are 
typical instances of the first phase of test construction ; namely, 

e isolation and definition of a concept which it is hoped will 


Prove effecti 
e 
etriye. ng to understand that the first step 


It will illuminati h 
the tie ieee is also in effect the first in the tech- 
niques of direch observation now widely used in child study. The 
Procedure is essentially that of watching the child or group of 
Children in action under normal conditions. The observation may 
fe Continuous or for a stated series of brief periods. It is desired, 
et us say, to observe a given child for friendliness, or self-help, 
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or some other trait which will show up in his behavior. And the 
first step is to clarify the concept and to define its meaning when 


it expresses itself in observable action. Without this, observa- 4 


tion lacks orientation and is fruitless (Olson and Cunning- 
ham). 

This phase of the process of making a valid test has been receiv- 
ing increasing emphasis in recent years, due to the wide use of the 
techniques of factor analysis. Factor analysis, as we have seen, }§ 
a statistical procedure (or array of procedures) designed to pro- 
duce tests which measure certain designated mental factors an 
nothing else. Thus the California Test of Mental Maturity is 4 
battery intended to measure immediate memory, delayed memory) 
spatial thinking, logical reasoning, numerical reasoning, etc. In the 
manual the authors say, in effect, that its validity turns primarily 
on its capacity to measure precisely these defined and specific 
factors. In other words, they rest their case, first and foremost 
upon a very careful analysis and defining of the basic concepts 
underlying the instrument, this analysis being achieved not 
mere conjecture, but by a closely controlled statistical techniaYS, 
Validity as so interpreted has been called “factorial validity 
(Guilford, 1946), i.e., the ability of a test to reveal some speci 
and designated component of mentality without any irrelevance, 
There is even a tendency at the present time to regard factor! 
validity as sufficient in and of itself, but this is open to very grav’ 
question. 

All this raises a fundamental question which has already bee? 
considered in a somewhat different connection. Do the traits k 
factors we propose to observe or to test really exist? Do our wor i 
ia concepts correspond to actual psychological entities? McCay 
cian. the issue in a basic Proposition which asserts that wA 

exists can be measured. By this he means that no matter ”; 
refined, Or spiritual, or exalted, or evanescent a mental functi? 
may be, if it exists at all, it must exist in some quantity; a" ‘ F 
exists in some quantity, there is always, in theory at Jeast; the 
possibility of ascertaining just what that quantity is. Gene 2 
intelligence and trustworthiness, which have already been m 
tioned, would be two illustrations of what this proposition me4 
But it refers just as much to honesty, aesthetic appreciation, cr cal 
tive power, unselfishness, sociability, spatial thinking, lof 


reasoning, and indeed to the whole vast range of functions of W 
the human mind is capable. j 


ns 


a 


` 


è. 
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One might, of course, object that it is one thing to admit the 
theoretical measurability of any given trait or factor and a very 
different thing to carry through the actual job of measuring it. 
But the relevant point here is somewhat different. We try to meas- 
ure or to observe many kinds of traits and functions and factors. 
But do they actually exist? Are they actual mental and behavioral 
entities? We see a child perform certain actions when he is with a 
group of other children and say that he manifests friendliness. We 
find that he makes a certain score on a vocabulary test and claim 
that this indicates a certain degree of general intelligence. We 
find that he is able to manipulate various spatial designs in a test 
Situation, and claim that this indicates a certain capacity for spa- 
tial understanding or thinking. So with all other such traits and 
functions. Must we, then, believe that these traits, and functions, 
and factors have a “real” or metaphysical existence? The assump- 
tion is certainly not necessary from the standpoint of test con- 
Struction and validation. The only necessary hypothesis is that the 
behavior which observation records or the test score registers 1S 
indicative of how the individual will behave elsewhere and under 
other circumstances. When it is said that a child has proved him- 
Self friendly in terms of certain observed actions, the implicit 
assumption is that he will prove himself friendly under other 
Circumstances and in other types of action. When his high vocabu- 
lary score is taken as a sign of high general intelligence, the state- 
Ment really means that he will deal intelligently with quite differ- 
ent problematic situations. When he succeeds in manipulating a 
Set of geometrical designs, and thus scores well on “spatial rela- 
tionships,” this amounts to saying that he will be competent, in 

andling spatial problems elsewhere. In so, far as the working 
concepts which direct observation and testing are well chosen, 
and in so far as they are translated into relevant ways of behaving 
Or relevant test items, this assumption is likely to prove true. This, 
Owever, does not necessarily mean that the concept corresponds 
to some real entity, but only that the sample of behavior that it 


1solates—the observed actions, or the test responses—1S significant ; 


*€., symptomatic of the behavior of this individual in other cir- 


cumstances. wk 
2. The second phase in the pbuilding of a valid instrument of 
Mental measurement is to assemble and select test items which in 
€ experience and judgment of the maker are likely to involve the 


rait, or characteristic, or function as conceived. That is to say, the 
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working concept must be translated into a series of specific 
responses elicited by properly chosen stimuli. 

This, of course, is a vital requirement, and it is not always any 
too well fulfilled. Thus many of the items in the Binet scale and 
its various revisions seem to correspond pretty well with the nature 
of intelligence as he conceived and described it. But others do not. 
The vocabulary test, for instance, which is always regarded as 
very important and valuable, would appear to depend a good deal 
more on past experience than upon the ability to follow a line of 
thinking, to adapt oneself for the sake of a goal, and to criticize 
one’s own endeavors. In the same way, such items as the repeti- 
tion of digits and of syllables certainly seem remote from intel- 
ligence as conceived and defined. Turning to the Thorndike LE-R- 
Intelligence Scale CAVD, this is a test which stays very close 9 
indeed to its stated working conception, for it is expressly built ue 
measure altitude of intellect by a graded series of tasks. ie, 
regard to Voelker’s attempt to construct a test of trustworthin® 
which has also been referred to, he devised many interesting a 
ingenious items to ascertain whether a subject would be likely 
abide by instructions without supervision. Some of them consist® 
of deliberate overstatements, such as asking a boy whether he he 
received a mark of 95 in arithmetic. Another consisted of requiriNg 
the subject to push a button every two minutes for a given perl” 
of time while a record was made to see if he obeyed. In yet 2” S 
other, the subject was given a page of arithmetic problems on t% 4 
reverse side of which appeared what purported to be the answet® 
Some of these answers, however, were wrong, so that it was P% 
sible to tell whether any copying was done. h 

It is usual in the construction of a test to bring together a mu? 
larger pool of items than can be used. The maker’s own judgme” | 
and experience begin the work of refinement and selection. T 
other capable and experienced persons are called in to carry ¢ 
work further. In the case of the Henmon-Nelson test, the help 4, 
a group of experienced teachers was secured in deciding whl 
items seemed most suitable. In the case of the I.E.R. Intelligens k 
Scale CAVD and of the two Stanford Revisions of the Binet sc4 
the working staff who collaborated in the test construction cons 
tuted in effect a jury. js 

It is usually at about this point that technical statistical analy? 
begins to be introduced. The tentatively selected test items 1%, 
tried out, often on quite large groups of persons similar to tb 
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for whom the completed test is intended. Items tend to be selected 
which give the highest possible correlation with total scores on 
the tentative test. The point of this is to be as certain as possible 
that the instrument will be internally consistent and that it will not 
Contain material irrelevant to itself. Another criterion used is the 
discriminative power of the items. The trial subjects are rated by 
Some external criterion, such as estimates by teachers, record of 
Schoolwork, or other similar existing tests, on the trait or function 
to be measured. Items that lead to similar responses on the part of 
Superior and inferior subjects are eliminated, and only those that 
Show marked differentiation are retained. Thus in the construction 
and revision of the Binet scale, a criterion constantly used for the 
Tetention of subtests was that they should show improved scores 
When given to children in higher age groups. Yet another method 
Coming prominently into use is factor analysis. Tryout tests are 
8lven,. Then an analysis is made of the scores and their interrela- 
tionships which shows that they are explainable as due to the 
Operation of a limited number of more or less well-defined mental 
factors. Then the items are rearranged and the tests reorganized 
So that each test in the battery measures one and only one such 
actor, 

These are standard and accepted methods for getting a suitable 
Set of test items for the measurement of some trait or function. 
However, the actual choice of items is much influenced by consid- 
erations that are not entirely explicit. Only those stimuli which will 


Yield scorable responses can be used. It would be useless to try to 
by showing him a 


test a Person for such a trait as nervousness y 
orror motion picture and noting his reactions. Also, it would be 


Useless e mental ability by presenting the subject 
i ee? Beatie < cientific data and asking him 


With a c f historical or s d ask; 

o ane Bee cue ath conclusions. In both these situations the 
response elicited might be admirably indicative, but it would be 
quite ; ; 

3. Lape giant stage in dealing with validation is to 
Check the completed test against outside criteria, so that the 
Maker can determine more or less certainly what he ee 

is has been called “practical” as contrasted an a 
Validity (Guilford, 1946), yet on any basis of scienti ja soun nee 
Fy a mooted an essential part of the process, No sient 

Ea esis can be considered e les irhas been checked 
basis of internal logical consistency, and unless 
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against facts. This certainly applies to the determining concepts 
around which tests are built. So validation against outside criteria _ 
is altogether necessary if we are to have tests on which confidence 
can be placed. Yet, as will appear, this external or “practical 
validation is a good deal less decisive than might be expected. 

A. One frequently used criterion is that afforded by other tests 
of similar order. This may be considered a minimum requirement 
in proper test construction and evaluation, Henmon and Nelson, 
in comparing their completed test with five others, report correla- 
tion of .68 to .77 for groups of subjects consisting of college 
students, correlations of -77 to .88 for high school groups, and 
correlations of .54 to -90 for elementary school groups. Repre- 
sentative correlations betwen four well-known group tests are 
Presented in Table 4. Quite often one of the revisions of the Binet _» 
scale is utilized as the chief external criterion for validation, 


TABLE 4 


CORRELATIONS oF Four Group Tests WITH ONE Anorier anp WITÉ 
Two OTHER CRITERIA, FOR F. RESHMAN HIGH SCHOOL STUDENTS 


(Quoted from C. C. Ross, Table 3, P- 77) 


= — 
I 
Detroit |Kuhlmann- Terman ee 
Advanced | Anderson Group Mental 
er a 
Detroit Advanced ...... aie 88 -86 Br 
Kuhlmann-Anderson ., Sasa .88 -86 a7 
Terman Group ....., ee 86 86 oats 
McCall Multi-Mental . Saino 81 a7 -78 
Average other 3 tests.. Sask +O 88 90 83 
Average Freshman Grades.. 46 +33 +56 42 


As a validation Procedure, this clearly leaves many open ques” 
tions. (a) It shows in general how well the new test conforms, 
accepted standards in the area concerned. But if this conformit 
is not very great, there is always the Possibility that it may be dU / 
to the superiority of the new instrument. (b) Much of the agre? 
ment which is indicated may be due to a strong family reser 
blance among the kinds of items used in tests of the same gener? 
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category. If one inspects a number of tests of intelligence, per- 
Sonality, interest, and attitude, it is not hard to find prima facie 
reasons why they will probably agree pretty well. (c) A fairly 
high degree of agreement no doubt indicates that the new test is 

ased on pretty much the same working conception as those against 
which it is checked. But if it is desirable to improve, refine, and 
better delimit basic working conceptions, this may not be without 
Its disadavantages. So in general, the procedure leads towards uni- 
formity and safety, but not towards the experimental betterment 
of Instruments of mental measurement. 

B. Quite often a test is checked against other tests which pur- * 
Port to measure different functions. If there is a high level of 
agreement, this indicates that the difference is apparent and 
nominal rather than real. Thus it has been found that certain tests 
of interest show quite a close relationship with inclusive batteries 
ot educational achievement tests. The conclusion is that in spite 
Of the difference in name and in content, what is being measured by 

e two types of instruments is to a considerable extent the same. 

en it becomes a matter of convenience and expediency which to 
use in a given situation. On the other hand, what is often found 
'S marked disagreement. Tests of mechanical aptitude and talent 
€sts, such as the Seashore Measures of Musical Talent, usually 
Show very low although positive correlations with scores on tests 
of general intelligence, This, of course, means that the two instru- 
pants being compared are measuring widely different functions. 
-t is, however, necessary to be very careful in putting forward 
interpretations of such findings. To say that musical talent or 

€chanical aptitude have nothing to do with intelligence would 
act be admissible, for the tests in question, like all others, are 
Simply based on working concepts which may or may not corre- 
ond to psychological realities, and indeed probably correspond 


ti 
© them quite vaguely at the best. 


his validation procedure may be regarded as a practical win- 
Nowing device. It enables us to group and subdivide tests as deal- 
we with similar or dissimilar functions. Such knowledge has con- 
Iderable importance in planning a testing program in which it is 


hated to get as wide-ranging a sample of mentality as can be 


zad, But it can hardly be taken as proving anything about the 
Yganization of the mind, or the true relationship of human 
abilities, 

G Very frequently tests are validated against achievement in 
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school. Here the most widely used criterion is that afforded by 
school marks. Thus Jordan (g.v.) reported the following correla 
tions of test scores against high school average marks: Army Intel- 
ligence Examination Alpha .38, Otis Self-Administering Test 0 
Mental Ability .49, Terman Group Intelligence Test .47. Henmo? 
and Nelson, in their validation procedure, found a correlation 0 
-60 for their test against composite high school marks. The rela- 
tionship between the Medical Aptitude Test and marks in medical 
college has been reported in Table 3, p. 20. 

As a validation criterion, school marks obviously leave a great 
deal to be desired. Even a composite mark has considerable unt 
liability. And as an average it is made up of components usual 
unspecified, and each with a weighting which is not reported. "i 
same composite mark would presumably mean two quite differe™ >» 
things if it were made up of separate marks in shopwork, mus!’ 
English, and typewriting, or of separate marks in algebra, geor 
etry, trigonometry, Latin, and Greek. The reason why the criterio? 
is so widely used is chiefly that it is about the only readily av" 
able numerical rating to be obtained on large numbers of persone 
Another and by no means negligible reason is that the widest 4 y 
of tests is in educational guidance, and if they do not indicate 4”, 
foretell school achievement expressed in terms of marks, their put 
pose will be defeated. 

A secondary criterion quite often employed is that furnished H 
teachers’ ratings. It has been frequently used in connection W! } 
the validation of intelligence tests. Here it is open to serious oe 
Jection. An intelligence test is supposed to provide a better indic 
ie ee than can be provided by estimates a 
be o use the latter for proving up the former se othe! 
kinds the va gung in a vicious circle. With regard to tests 0 ure 

Se is somewhat different. Thus the Stenquist Meas’ ys 


be ee pe Aptitude were checked against ratings by teac eh 
r al training and science. M 5 ith cond! 
attitude, and personalit any tests dealing wi n= n 


have b k jai opini®, 
of teachers regarding t y een checked against the op 


L) 
he behavior of groups of subjects. There, 

a vay good chance that such ratings e A am E \ 
functions are reasonably definite and observable. The teachers tl 
often specialists. They are likely to have fairly intimate con Ct 
with their pupils. It is quite believable that their opinions 2 ijg 
the mechanical competence and general behavior of these P” : 
are reasonably accurate. 
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ele out-of-school criteria have been used for purposes 
Th art la ation. One quite frequently used is vocational success. 
loved np with intelligence tests this criterion is not much em- 
theachar seems promising, but as a matter of fact it is decidedly 
in jenn a Just what is meant by success? Can it be registered 
of succe: of salary? Does a salary of $5,000 indicate the same level 
sučtess SS In the Methodist ministry and in banking? How can 
ton ae in different vocations be compared? Is success an indica- 
some s genuine competence, and particularly of competence in 
least Specified mental characteristic? Problems of this kind are at 
hie ie difficult if not insuperable, and the result has been that 
(v. Bin ea has proved almost unworkable for most types of tests 
tion is ai pP- 229-35). One good instance of vocational valida- 
Men. It a provided by the Strong Vocational Interest Blank for 
given 3 as been shown that those who succeed or continue in a 
i ocation are much more likely to show a strong interest in 


i 
: pa are persons in general. , 
tests tests of personality traits, 
ieee to institutions 
(ome is used as a validation crit 
ê Pee These are probably about t 
eee ka involve carefully consi 1 
ests sible action. As will be pointed out later, the best personality 
e to up fairly well in this respect. This seems to 
at a that they are built around well-chosen concepts and 
items Aa concepts are translated effectively into really indicative 
that th s to intelligence tests, the use of such criteria has shown 
and dis two revisions of the Binet scale are inferior as clinical 
aE ee instruments to certain other individual scales, 
Mean y those of Wechsler and Kuhlmann. This, however, does not 
tispecta the Stanford revisions may not be superior in other 


T A number of recent developm 
Was the wide use of mental tests in World War II, have called 


i : 
ppacasing attention to the problems and techniques of validation. 
be a important single question about any test must always 
varion of its validity, but when the selection of personnel for 
trent types of war service was involved, the matter became very 
os Psychologists were very definitely challenged to produce 
ea of mental measurement which would definitely justify 
elves and yield indubitable results (v. Staff, Personnel Re- 


and also for some intelligence 
for the feeble-minded or for the 
erion. So also is later psychiatric 
he best criteria we possess 
dered expert opinion and 


ents, notable among which 
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search Section, Classification and Replacement Branch, the i 
jutant General’s Office, 1943 c). J. G. Jenkins (1946 b, p. 93) has 
summarized the effect of this situation as follows: “The events of 
World War I taught American psychologists the necessity of 
validation. The experience of the next two decades taught the™ 
much about the technique of validation. It remained for Worl 
War II to drive home to psychologists the necessity for devoting 
much time and thought to the basis of validation.” In passing, it is 
interesting to remark that while mental tests were widely used i? 
the German Airforce, the problem of their validity was largely 
ignored. Apparently German psychologists relied upon what } 
often called “face validity,” which is usually understood to mea! 
prima facie or “common-sense” validation, without any careful % 
controlled checking or investigation (v. Mosier, 1947). X 
The two most important considerations which have emergi® 
largely as a result of war experience are, first, the need for & cat’ 
ful analysis of the criterion, and, second, the need to determine tHe 
degree of validity required if a test is to be usable. (a) First, it 
recognized that the criterion must be analyzed. The genera i 
terion of “success” can be extremely misleading. We prop, 
perhaps, to set up a test to select competent typists. If it is t0 i 
validated against success in typing, however, we must remem f 
that one typist may be fast but inaccurate, another slow Pi 
accurate, etc., so that there are many kinds of “success.” The on 3 
way to meet this situation is to construct a test which will yiele 4 
profile of the various factors of success and failure in typew! in 
; nce, the number of hits scored by an aitc" p 
pasa under training conditions may seem a very good indicat 
a a oo test, but as a matter of fact this 5 a 
to trainees on check fli t an F wos dinana batrai a 
ghts by experienced instructors showe® gr 
most no agreement at all. And in psychological experience in the a 
it very frequently appeared that a test which would predict succ? 


: Ae t 
in training had very little relationshi com i 
performance (Jenkins, 1946). So it is Ded peaks grad 
during training, ratings by instructors, and even output on the Iv 
are all dubious as criteria unless they have been very carel 
analyzed (Stuit). (b) The second question that has become pr? f 
nent has to do with the degree or level of validity that mu$ r 
present. In place of the general statements that used to be accePt? l 
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lt is now recognized that this depends largely upon the use to 
which the test is to be put. Ordinarily, it would ‘be said that a 
Correlation of .50 against an established criterion indicated a low 
Validity, and raised doubts about the usability of the test, at least 
Without qualifications and other indices. But if the problem were 
to pick out the top 30% of prospects for some job or assignment, 
a correlation of .50 with the criterion would yield 74% of good 
Choices (Taylor and Russell). In other words, one cannot make 
Seneral statements about the required validity of a test. One can 
only say that the more selective the decisions to be based upon it, 
lower the correlations can safely be, and conversely.* 

Such, in schematic outline, are the methods used in establishing 
and determining the validity of psychological tests ; that is to say, 
m reducing the constant errors of the instrument. Clearly, such 
Procedures are all of them trial-and-error experimental processes 
€ading to few absolutely clean-cut opinions. As a matter of fact, 

e ultimate validation of any test is to be found only in its wide 
aa serviceable use. The basic conceptions are never perfectly 
> ear, The test items are never perfectly relevant, nor can they 

‘veal all the significant aspects of the function in question. The 
enon is never wholly assured. This means that constant errors 

N never be wholly eliminated. As we have seen, the question has 
ie Taised whether tests, as we know them, can ever have any 
alidity at all for the reason that mental traits and functions can- 


act be isolated. The answer Clearly is that the process of analysis 


cee i i ilt i f erfect but is 
` Which any test is necessarily built is far from p 
x a hole, our best tests have 


Signifi c 5 3 
Cant. For, taking the picture as a W est i 
ycblished at least a working and serviceable validity. And if 
Verything in the science of psychology except what was suceptible 
Precise definition and indubitable proof were thrown out, how 


u a 
ch would remain? 


- ai 


Ki 
reliah unchanging subjects are r 
e instrument by a perfectly reliable agent, ) 
pst een the two sets of scores is 1.00” (Walker, p. 265). This 
) mea End briefly summarizes what is meant by reliability of 
ent ement, Each separate application of the measuring instru- 
_... Yields the same result. Each single set of measures can be 


aylor and Russell have published tables showing the precise relationship. 
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completely relied upon. Variable, i.e., accidental errors are elimi- 
nated. Such a condition is not attained even by the finest of 
physical instruments. Astronomical observatories, laboratories 
using the most advanced devices, factories manufacturing the most y 
accurate appliances never achieve this final perfection. Variable 
error is always present to some degree, and allowance for it must 
always be made. This is much more conspicuously the case with 
psychometric instruments. But if there is to be measurement at 
all, the ideal must be approached to a reasonable degree. The 
tolerances must be fine enough for the practical purposes desiret 
Thus psychological tests are constructed to attain the best possib £ 
reliability, or at least a reliability sufficient for the ends to 
served. 

In order to understand how this is done and what it involve ‘, 
three points must be considered. They are first, the major caus 
of unreliability in mental testing and their avoidance; second, “a, 
methods used to ascertain the actual reliability of psychomet” 
instruments ; third, what can be accepted as sufficient reliabilit 
to make a. test serviceable. As with validation, it will be 100%, 
that the whole subject is less conclusive than is often sip 


posed or than cursory and elementary accounts frequently 5 
gest. 


1. Causes of unreliability and their avoidance 
_ The causes of unreliability may be classified as those which 
in the test itself, those which are in the person who takes its "83 
those which are in the Person who gives it (Symonds, 19 
Walker, p. 258). the 
i th First consider those causes of unreliability which are ap 
est. ` 
. ct est 

(a) Other things being equal, an increase in the length of 4 mre 
will increase its reliability. The reason is that the addition 0 ity 
and more items yields a better sampling of the subject’s true ab! A 
and gives him a better chance to show what he is able tO © few 
person may do unusually well or unusually badly with a very ot? 
items, but if there are a great many of them, such variable erh 
tend to cancel out. t0 

However, reliability does not increase in direct proportio” ge 
the increasing length of the test. For instance. if the test i8 on 
twice as long, its reliability is not doubled. The formula for : 
puting the effect of increased length is as follows: 


até 
nd 


aaa 
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bs Nr 
=EN- 


\ 
A Where r, is the new reliability coefficient, r the original reliability 
Coefficient, and N the multiple by which the length is increased. 
0 if the length is doubled and the original reliability is .70, the 
New reliability becomes: 


= FRAP 
eta 


Or if the length is tripled, the new reliability becomes: 


ea IO _ es BB 
= t+ (2 X.70) 


Therefore, increase in length has rapidly diminishing influence 
i reliability. Clearly then, on a balance of all the factors con- 
ined, there is always a question as to whether a great increase 
y the length of a test to secure a small increment of reliability is 
Vorth while, 5 
rely), Irrelevant or disturbing elements in a test tend to lower its 
upe lity, It is because reliability depends to an imporfgnt degree 
Be the pertinence of the contents of a test—upon their rele- 
ae to what is to be measured—that mere increase in length 
eee be considered independently as a favorable factor, and 
'st be thought of in relation to other matters as well. The follow- 
mh are among the most important causes of irrelevance or disturb- 
i that occur in psychological tests. 
r a Wide range of difficulty. Other 
tome of difficulty tends to increase T lity ‘Great 
of thes of a test is too easy and much of it is too dificult, many 
aia € items do not evoke significant and revealing responses, and 
inte thus irrelevant to the task of measurement for which it is 
seag aed. (ii) Scaling. Other things being equal, a test with items 
e in ascending order of difficulty will be mere reliable than 
Md With items in random order, or with the most difficult ones 
be he reason is that items in ascending order of difficulty tend 
jeep oe a more consistent and significant response, since the sub- 
of ih Not disturbed by abrupt and irrelevant frustrations. Some 
IE Ur best tests are constructed in this way, for instance, the 
SS Intelligence Scales CAVD. (iii) Item independence. Inter- 


things being equal, a narrow 
eliability. When much of the 
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dependent items, i.e., those which present the same problem in 
different forms, tend to lower reliability. The reason is that the 
` form of the item tends to effect the response, and this is essen 
tially an irrelevance. (iv) The incidence of chance. The use of 
items which involve important elements of chance tends to lowe! 
reliability. Thus the two-choice item, of which the familiar true 
false item is an instance, and which involves a theoretical 50% 
chance, is the least reliable form item by item. Four- or five-choice 
items, in which the element of chance is greatly reduced, are more 
reliable item for item. For this reason it is standard practice t0 
increase the total number of items where the two-choice type 
used to compensate for item unreliability. It is best not to make 
Correction for chance by right minus wrong scoring, to have #! 
least fifty such items, and to combine them with items of othe! b 
types. (v) Catch questions tend to lower reliability, because thei j 
introduce an obviously irrelevant element. (vi) Emotionally load? 
items decrease reliability. For instance, if a test contains items ™ 
race problems, it may be less reliable in a community with stron 
racial prejudices than when they are absent. Again, in some tes% 
unpleasant, or disturbing, or ill-printed, or unintentionally funn? 
pictures may be used, which clearly involve irrelevance and 4! 
turbance. f 
_ The above may be considered the chief empirical elements W 
irrelevance that often lower test reliability. The techniques e 
factor analysis, however, go beyond these matters. Their purp 
t 
t 


= 


is to produce tests that are “factorially pure,” i.e., entirely ? 


Test in comparison with the Otis Self-Administering Test of N ie 
tal Ability, of which it is an adaptation. The Otis Test yates 


, Le., the subject, are as follows- a 
(a) The more common and familiar the subject finds the e3P 
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ences which make up the content of the test, the greater the relia- 
bility is likely to be, other things being equal. It is often suggested 
that mental processes, and more particularly general intelligence, 
ought to be tested by means of items quite beyond the range of 
Normal experience for the sake of avoiding the influence of the 
€nvironment and for the isolation of hereditary factors. Such a 
test, however, would be inconceivable, and it is much better frankly 
to use items turning upon common experience on the assumption 
that they will reveal true differences. This consideration is of 
reat importance when one proposes to administer a standard test 
to members of a racial group for whom it was not specifically 
Planned. The unfamiliarity of the items may greatly lower the 
reliability of the test. To cite another instance, Army Intelligence 
Xamination Alpha, developed and used during World War I, 
Came under serious criticism when it was given to civilians, be- 
Cause it drew rather heavily on special military experience and 
‘Nowledge, ; 
(b) The mental set of the subject is always highly important. 
ny uncontrolled variations of mood and atitude at once lower 
the reliability of the measurement. Thus it is found that a test run 
ate in the school year is apt to yield more reliable scores than it 
Would if run soon after school opened in the fall, particularly with 
© younger pupils. The reason, of course, is that as time goes on 
the children achieve a balanced orientation and are able to accept 
€ test situation without being disconcerted, or amused, or an- 
noyed, or otherwise disturbed. There is a special problem here in 
Connection with tests given to preschool children and infants who 
are unused to such experiences. The scores may have too low a 
reliability for any use because the negativism and shyness of the 
Subject disturb the testing. Willingness to cooperate and the 
avoidance of distractions are factors of major importance in the 
Securing of reliable test scores. i i 
(c) Other things being equal, the use of practice material in a 
test before the subject comes to deal with the actual items is likely 
© increase reliability. But here again a balance of factors must be 
‘ept in mind. The American Council on Education Psychological 
Examination for College Freshmen has been adversely criticized 
“cause, in spite of its many excellent points, it comprises rọ 
MUutes of practice and 33 minutes of actual testing, which seems 
worta sProportionate. Some n for ma ee the Humm-Wads- 
emperament Scale, actually include a great many “deaq” 
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items, i.e., items that are not scored although the subject responds 
to them just as he does to the others, partly for the sake of main- 
taining a good attitude. What effect this has on reliability is not 
known, so far as the writer is aware. 

C. Causes of unreliability in the agent, i.e., the person who gives 
or administers the test, are as foliows. . 

(a) Inaccurate or prejudiced scoring. Obviously if the opinions 
or feelings of fatigue or boredom of the agent enter in, the relia- 
bility of the test is decreased. Most tests are set up with con- 
siderable safeguards against such eventualities. More and more of 
them are appearing with scoring stencils or machine-scoring de- 
vices. But it should be noted that the excessive mechanization of 
tests, although very convenient and tending towards the avoidance 
of certain types of variable error, tends also to restrict the range 
of items. Test items are not notable for their flexibility at the 
best; but when they are set up so that the only response the sub- 
ject needs to make is to punch a hole in the proper box with 4 
stylus, which is a common arrangement in machine-scored instru- 
ments, they are still further restricted. 

(b) If the agent or person who gives the test, because of Jack 
of understanding, or skill, or proper care, gives insufficient O" 
varying directions and instructions to those who are taking it, the 
effect is definitely to lower the reliability of scores. Most test 
manuals insist very Strongly that the exact wording of the test 
instructions must be followed, the exact time allotment observe” 
and all details carefully handled. When it comes to individua 
tests, the importance of Proper administration is greatly increase 
To handle such a test Properly requires very considerable thought, 
study, and instruction and not a little practical experience. 

(c) If the agent fails to establish effective rapport with the 
persons who are taking the test, the reliability of the obtain 
scores is decreased. The mood, the willingness, the seriousness, th€ 
cooperativeness, and the good will of the subjects are factors © 
primary importance. All this, again, is more prominent in ind 
vidual than in group tests. But defective rapport can make the 
administration of the best group test a waste of time. P 

It should by now be quite apparent that the problem of reli2” 
bility is not simple. Variable errors can come from many caus? 
and they cannot be entirely eliminated, Also, a balance of facto" 
is always involved, for it is quite possible to boost reliability 4 
the sacrifice of validity or significance, as for instance by greatl 


AS INSTRUMENTS OF MEASUREMENT 51 


restrict iii 
dead a the range of difficulty, or by reducing all items to a 
vel of uniformity for the sake of accurate and quick scoring. 


2. H ana 
ow the reliability of a test is ascertained and recorded 


ana common sense proposition there is only one way of ascer- 
to tepeai x e aiy of any obtained measurement, and that is 
ne i el . This, in fact, is what any worker will do—a carpenter, 

same = se If he finds that his two or three measurements of the 
rely on e of board coincide closely, he will conclude that he can 
Peak one of them sufficiently for practical purposes. If he 

apply his to be interested in the theory of measurement, he would 
ow clos rae many times, keep a record of the results, and see 

Variability, they agreed. This would enable him to calculate the 
the reli y or amount of error involved. If he wanted to ascertain 
iability with which he was able to use his rule, he might 


Make 
ke two measurements each of perhaps twenty different jobs. 
y measurements, the first 


heir correlations. This 


oul zae 
d be a type of coefficient of reliability. 
ly used in mental measurement. 


twice to the same group. If the 


en there are two groups of twenty 
o groups of scores 


Aer i . . 
With e is, however, a certain persiste 
onably assume that 
boards or posts in the interval 


of his rule. Moreover, the act of 


etwe 

ë Sao 

n the two applications bei d 
t being measured. 


Mea: : ; 
Surement itself does not affect the objec 


ei fi ` 
ue of these assumptions is equally safe when one 1S dealing 
Psychological phenomena. Many changes may actually take 

ween two sessions devoted 


a 
to rhs a group of human beings bet 
as fati ng. Even if the time interval is short, there are such factors 
if į Ro alteration of mood, and so forth, to be considered. And 
affect thew! then mental growth and specific learning can easily 
Somethj e situation. Also, a group of human subjects will learn 
Supers; Ing from one experience with a test. So, although they may 
ficially seem the same at the time of the second testing, they 
considerably, and will certainly 


a $ 
hav 1n reality have changed very | 
changed to some extent. ‘All this greatly complicates the 
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problem of ascertaining the true reliability of any test. The meth- 
ods most generally used for dealing with the matter are the 
following. 

A. One may repeat the same test with the same subjects. Then 
the two sets of scores are correlated. The obtained coefficient is 
often called a reliability coefficient, but it is in reality a special 
class of reliability coefficient and should more accurately be termed 
a retest coefficient. This is a very frequent practice, but it clearly 
raises some serious questions. In a short time interval there is very 
likely to be some specific practice effect carried over from the first 
to the second testing. Also, there is very likely to be some general 
orientation to the test, its items, its timing, its setup, and so forth. 
If the time interval between testings is long, the obained correla- 
tion will probably reflect the effect of growth, of learning, and of 
environmental influences generally quite as much as it does the 
reliability of the instrument. The trouble in general is that even 
though the test itself is well constructed and avoids the more 
flagrant causes of variable error, the subjects to whom it is applied 
do not remain the same (Cronbach). 

B. Another procedure is to administer two forms of the same 
test to the same group of subjects, and to express the relationship 
between the scores ina correlation coefficient. Here we have an- 
other type of reliability Coefficient, sometimes called the interfor™ 
coefficient. When a well-constructed test comes in two or morë 
forms, they are equated for difficulty, equality being established 
by the scores of a selected group or groups. Clearly the interform 
procedure avoids most of the objections involyed in the retest 
procedure. Yet serious difficulties and reservations remain. The 
equivalence of the forms may be, and indeed probably is, supe 
ficial and imperfect (Cronbach). It is quite possible that tw 
forms may be of equal difficulty for one group, but not for another: 
If so, the correlations will reflect this difference and will be 
erroneous estimates of the reliability of the instrument. Moreovels 
the experience of having taken one form is almost sure to carry 
over to some extent to the taking of another; and this still furthe? 
undermines the assumed equivalence. 

C. It will be noticed that both procedures just described work 
in terms of actual assumed equivalence. The second giving of the 
test, or the second form of the test is assumed to be equivalent tO 
the first; and whether or not the assumption is justified, the pre 
sumptive equivalent test is actually given. Another approach tO 
the reliability problem turns upon administering only a single 
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testing, and proceeds in terms not of actual, but of hypothetical 
_ “rational” equivalence, to use the expression of Kuder and 
ichardson (q.v.). 
EE most familiar method of doing this is to run the test once, 
Es hs split it into two halves. These two halves can be considered 
‘i z tests taken at one sitting, and a coefficient of correlation 
— ated between the two sets of obtained scores. But each of 
be halves is only half as long as the original. So the obtained 
whine oe must be boosted by the formula given on page 47, 
fat is known as the Spearman-Brown Prophecy Formula. Thus, 
thio est containing, in all, 150 items 1S treated as two tests of 75 
S each, and if the correlation of obtained scores on the two 
alves is 60, the formula would yield a self-correlation for the 


Whole test as follows: 


2X 60 _ 
yore O 


ae r is the obtained correlation of .60, and Tx the derived co- 
fue Notice that this differs in principle from the retest and 
ane ori procedures, because two supposedly equivalent forms 
te Not given, the hypothesis being that if two genuinely equiva- 
nt forms were given they would correlate as indicated. 
4 his avoids the difficulties considered in connection with meth- 
F S A and B. But it involves a serious objection of another kind. 
Or there are a great many different ways of dividing a test into 
two halves (Kelley, 1942; Kuder and Richardson, 1937). And 


ai obtained coefficient will depend upon the way the test is 
\ided, Split-half reliability coefficients are very often reported, 
Ut all of them are infected with this uncertainty. 


© meet this problem Kuder and Richardson (1937, 1939) have 

or posed a teckniqne for deriving & reliability coefficient from 

ne giving of a test, without splitting it into, two parts. They 

Present 22 formulae for doing this, among which they specially 

acommend No. 20. This formula requires nothing but tor a 
nce of ‘* t f items, the percentage of rig 
the test,* the number 0 the Pit is as follows: 


an 
SWers, and the percentage of wrong ans 


n 4ean 
Ta = aei oe 
F P” central 
tenat variance is the average of the a oo deviations abon ts nse irk 
cy, i. eviation. 
the stap? Le., the square of the standard evn oe, also Bayt. 


atistics of psychological researc’ 


54 PSYCHOLOGICAL TESTING 


where =pq means the sum of the products of p and q for all test 
items, p the per cent of right answers given on a test item, q the 
per cent of wrong answers given on a test item, n the total num- 
ber of items, and o the variance of the test. This technique, 
which avoids the bias involved in splitting a test, is coming into 
wide use. ' 

It should be emphasized once again that the coefficients ob- 
tained by both the Spearman-Brown and Kuder-Richardson 
methods differ in principle from retest or interform coefficients. 
The latter are obtained by actual correlations of two supposedly 
equivalent testings. The former involve only a rational equiva- 
lence, and express the internal consistency of the test (Kelley; 
1942; Jackson and Ferguson), 


expressed (or, more precisely, estimated) by means of a coeffi- 
orrespondence or rela- 
btained by two givings» 
assumption of rational equivalence. 
know just how much error is in- 


SE =0 V (1> r) * 


SE is a symbol fre 
error; o is the stan 
coefficient. The prac 
coefficient and of a 


quently but not always used for standard 
dard deviation; r is the obtained reliability 
tice of reporting reliability both in terms of 4 
standard error is quite common. Its advan- 
tage can be seen from the following hypothetical instance. 
Let us Suppose we have a retest Coefficient of .85 and a stand- 
ard deviation of 18. Then the standard error will be 9.36, ie» 


18 X Vi = (85), This tells us that there is a 68% chance that 
a person making any given score on the first testing will score 
within a range of +9.36 to — 9.36 of that score on the second 
testing. That is, if a Person makes a score of 170 on one testing: 
there is a 68% chance that his score on another testing (which is 
often called his “true score,” i.e., his score on any other testing) 
will fall between 161 and 179. 


* There are various qualifications and ssumptions ioned 
r lif $a ssumptions here th entione 
in the present brief exposition (v, Walker). . eee 
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i anal pa it is quite clear that the reliability data reported 
a a and elsewhere cannot be taken at face value, but 
el ays be critically scrutinized. Consider, for instance, the 

ity coefficients shown in Table 5. In examining them a 


nu 
mber of comments are in order. 


TABLE 5 
REPORTED RELIABILITY COEFFICIENTS FoR SEASHORE Test oF PITCH 
DISCRIMINATION 
(From Mursell, Fig. 23, P- 293) 
Subjects Methods cao tad 
reliability 
z a students ....++++ veeeee]  Fetesting 54 = .04 
is 3 US sicserwaion PE ..|  retesting .64 + .03 
ete ere pee vee sad retesting .go + .02 
college school students ...e.eseeesetet retesting .73 £ .05 
i students .......++ retesting -68 £ .04 
News college students ... retesting 188 + .02 
“ihe college students ... retesting 27 £102 
él College students ... retesting .69 + .03 
ored college students ...+ retesting .58 + .02 
93 high school students ..++ retesting 71+ .03 
= Music students ...essesseeseseeeet retesting 76 + .03 
= music students ...+++ split half 60 + .03 
ci music students ...... split half 51 + .03 
Sa Students ....0.++ nni split half 84 = 02 
~328 college students ...+++ „| split half 74 + 02 
— 75 = .03 
es ighth graders ...+eeeeeerererttt retesting 83 = o1 
Seve eighth graders ..ee+eeeee* veee|  Tetesting .85 = .02 
ci Braders ....++- a.a] — retesting go = or 
Brades vasessres aaeeeo]  Tetesting 82 = or 
oa Preadolescents ... Laane] split half 154 == +03 
Adolescents ....ceeeee mawaa] Split half .40 £ .04 


fficie 


nts. Yet the 


time interval 


(a) So 
me of them are retest coef 
d in the table and often does 


t 
Ween test and retest is not state 
appear in the full and original res 


earch reports 


from which 
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the data are taken. In the case of this particular test, the time 
interval may not be of prime importance, although one cannot be 
sure that it is not. Very often it is. 

(b) The wide variation among these obtained coefficients is 
instructive to any student of testing. Such variation is commonly 
found in all similar reports. It is no doubt largely due to variable 
errors of one sort or another, which, as shown, can originate from 
a large number of causes. The actual test in this case is very 
competently constructed. It consists of 100 pairs of tones differing 
in pitch, beginning with fairly large differences and going to very 
small ones. But its efficacy depends upon the newness of the 
phonograph record and also on the reproducing device and the 
acoustics of the room. Also, many people find a first 
to it difficult and annoying. Variable errors due to the instrument, 
then, cannot be ruled out. And errors originating in the agent and 
in the subjects are not only very probable but indeed certain to 
occur. 


(c) Another and purely statistical consideration must also be 
borne in mind. The greater the spread of ability in a tested group» 
the higher the obtained correlations will be, other things being 
equal. It is almost certain that these numerous groups who were 


used to calculate coefficients of reliability differed considerably i? 
the spread or range of their Capacity to discriminate difference’ 
in pitch. This in itself would 


x : l result in different obtained coeffi- 
cients, even if all variable errors were avoided. 


A very pertinent question then arises. Which of these obtained 
correlations expresses “the” reliability of the Seashore Test ° 
Pitch Discrimination? The answer is: None of them! There is n° 
such thing as “the” reliability Coefficient of any test, There are 


only reliability coefficients obtained with different groups undet 
different circumstan i more or Jess success! 


ious kinds and coming 


adjustment 
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3. How much reliability must a test possess? 


There is no one simple or standard answer to this question. 
How much reliability must physical measurements display? If a 
Carpenter is making a sawhorse, considerable error is allowable. 
If a cabinetmaker is constructing a fine table, the margin of error 
must be much smaller. If a factory is manufacturing automobile 
engines, accuracy must again be greater. If airplane motors are 
being made, it must be greater still. And ifa hundred-inch reflector 
for an astronomical telescope is being fabricated, then the utmost 
accuracy and the highest reliability that human ingenuity can 
Compass is barely sufficient. So in physical measurements, the 
allowable degree of tolerance depends on the purpose involved. 
This is also the case with mental measurements. 

It is. often said that if a reliability coefficient is calculated for a 
Stoup with the range of one school grade it must be at least .50 
in order to discriminate between two group means with sufficient 
Certainty so that there is a five-to-one chance of the difference’s 
being real. For individual classification, however, a test is said to 
require a reliability of at least .94 when calculated under the same 
Conditions, These statements were made by Kelley more than 
twenty years ago (q.v., 1927; PP- 210-11); and have been quoted 


Many times since. One formidable implication is that few pub- 
and that they 


lished tests are adequate for individual diagnosis, a t tl 
can at best be properly used only for rough group differentiation, 
or few of them have reliabilities of .94 and over, computed on 
a range of one school grade. 
The reply is, however, that w 
yond question correct on the assu 
ased, they do not take into cons 
mhich tests may be put (Guilford, 
at if a test is to be used for very 
Screen out all but ten, or twenty, or thirty per cent of prospects, 
can have a relatively low validity and reliability and still be 
Serviceable and dependable. Also a test may make a unique con- 
tibution, and so belong in a battery in spite of low reliability and 
validity. Thus a short test requiring judgments about the lengths 
lines was set up in connection with airplane pilot selection 
during the war. It had a reliability of only .25, but because i 
tought out a unique factor, it belonged as a proper element in 


t A 
he selection procedures. 


vv 


J 


hile the above statements are 
mptions upon which they were 
ideration the actual uses to 
1946). We have already seen 
exacting selection, e.g., to 
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Kuhlmann (1939) has raised a question of fundamental sig- 
nificance in this whole connection. He contends that tests ought 
not to be constructed in such a way as to yield uniformly high 
reliabilities. His argument is that there are many variable con- 
ditions which are in fact important and relevant and that a sensi- 
tive instrument ought to register them. For example, he says a 
headache actually though temporarily affects mental ability, and 
a test so constructed and administered that it overrides and 
ignores the influence of the headache is false to the real facts of 
the situation. Many of the factors which we have called variable 
errors, particularly those originating in the subject, Kulhmann 
would regard as perfectly legitimate variations in the facts them- 
selves. Carried to an extreme, this argument would render all 
testing and indeed all psychological experimentation futile because 
it would be overwhelmed by uncontrolled variable influences of 
many kinds. What Kuhlmann really seems to have in mind is that 
a test should be considered as a standard situation which a clini- 
cian or guidance counselor uses to help him in observing a sub- 
ject’s reactions. The score itself would presumably be of much 
less importance than the conclusions which the expert would draW 
in studying the performance of the person to whom he was admin- 
istering the test. To understand this position, it must be remem- 
bered that Kulhmann has advanced it in connection with his Tests 
of Mental Development, which is an individual instrument. The 
oe apply to group testing. It must presumably be 

terpreted as a claim that an individual test should be use 
oe > an instrument of diagnosis rather than of measure- 

—perhaps as primari jecti i 

sense cerciceiaae ily projective and only in a secondary 
bility ai e e apparent that the issue of test relia- 
Onclusive than many people have been led tO 

suppose. The upshot clearly is caution. Psychological tests are very 
ar indeed from being automatically usable instruments with 3 
ability which invariably appears in the scores obtained: 

No one set of test data can be considered final. So-called snapshot 
testing, in which instruments of measurement AE applied once t° 
groups of subjects and all sorts of long-range conclusions CO?” 
fidently drawn, should certainly be avoided. The total personal 
situation of the subjects must always be taken into account. But 
when all this is said, the fact remains that our existing psycho- 
metric tests, in spite of their limitations and uncertainties, hav? 
Proved exceedingly valuable and constructive instruments, a” 
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that their value can be enhanced by more discriminating and well- 
structed use and interpretation. 


OBJECTIVITY 


By the objectivity of a measuring instrument is meant its free- 
dom from errors due to personal feeling and bias, its ability to 
reflect the facts of the situation irrespective of the “personal equa- 
tion” of the agent. The usual way of explaining this characteristic 
'S to say that a test is objective to the degree that a number of 
different persons applying it to the same phenomena and scoring 
lt will come out with results which agree. 

Most psychometric tests achieve a high nominal objectivity by 
quite simple means. The items of which they are constructed 
Permit only a single “right” response. These responses are scored 
bya key, or a stencil, or a machine. Sheer mistakes are of course 
always possible, but personal opinion and bias are eliminated. In 
Some individual tests, to be sure, the judgment of the agent is 
‘volved. Thus the subject’s response to some of the Stanford- 

inet items, e.g., the vocabulary items, can be quite varied, and 
the person who is giving the test has to decide how good they 
are, Even so, the manual contains a wealth of very specific instruc- 
tions and limits and particularizes judgment, keeping it within a 
narrow channel. : 

high degree of objectivity is usually regarded as very desir- 
) able and as relatively easy to achieve, but there are quite a num- 
rat at qualifications to consider before accepting this view out of 

nd. 


1. When high objectivity is obtained simply by using items with 
“nly one allowable and specified response, what is gained in one 
Way May easily be lost in another. For instance, in a test which 

as given to a group of Indian children, the following item oc- 
CUrred - “Crowd—closeness, danger, dust, excitement, number. 


Wo of the last five words were to be underlined, on the basis of 
their congruence with the stimulus word crowd. The tendency of 
ice Indian children was to underline “dust,” “excitement, and 
' danger.” In view of their special background and experience these 
p eS Perfectly intelligent responses. But they did aot agree with 


th 5 Š a F 
i a Scoring key of the test (Fitzgerald and Ludeman). Or again, 


the Goodenough “Draw a Man” test of intelligence, the child 
5 


a tated on the lines he puts in to represent the various parts of 
h tational factors. But a 


uman anatomy and similar represen 
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worker using such responses as Projective indications might feel 
that they suggested all sorts of interpretations as to the types of 
personality, the emotional organization, and the temperaments of 
the subjects and might object to limiting them as the test requires. 
In dealing with true-false items again, only the standard “correct’ 
responses receive credit. But if one could know the mental proc- 
esses by which unacceptable answers were arrived at, they might 
be considered indications of high intelligence and judiciousness. 
To put the matter in general terms, the objectivity actually mani- 
fested by a great many tests is arbitrary rather than real and to 
that extent spurious. It is something forced upon the subject’s 
responses by the mechanism of the scoring. 

2. In certain kinds of test situations this usual resort to defi- 
niteness and to scoring on predetermined “right” responses is not 
possible, and the agent must make some sort of evaluation of the 
subject’s response. The ordinary way of determining whether such 
evaluative judgments are objective, and to what extent, is to have 
a number of judges evaluate the same Phenomena and then ascer- 
tain the extent of agreement. But there is a superior method. 
Objectivity is established if a number of evaluations made by 4 


single rater correlate with one another as well as but no better 
than evaluations made by a group of persons rating. Thus in one 


e judge agreed with himself to a higher 


y the intragroup agreement, whic 


eee evidence of bias, This technique has been little 
used, but it offers a superior method of dealine ith objectivity 
(H. F. Adams). PE 


3. This leads directly to anoth 
mental issue. A test may be hi 


a far-reaching subjectivity. Most intelligence tests correlate wit 

themselves considerably higher than with other intelligence tests: 
That is to say, their retest coefficients are higher than their COT” 
relations with other tests purporting to measure the same menta 


characteristic.* On the criterion explained above, this indicates 


* The reader is referred back to Table 4 for typical intercorrelations amoM8 
intelligence tests. 
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= memes of bias or subjectivity, or of an arbitrary element in 
a st. What this element is seems easy enough to indicate. The 
hea res around which the test is built is never perfectly 
sitene nor is it translated into test items with perfect con- 
a The test rates fairly well on what its author considers 
Bi E and on the items into which he has managed to trans- 
te kes operating conception, but it rates less well on a common, 
th ima opinion as to the nature of intelligence. The same is 
Mema lore true of aptitude tests. Thus the Drake Test of Musical 
port ¢ ty and the Seashore Measures of Musical Talent both pur- 
Sunes yield scores indicating musical ability. But the two authors 
Š a of the function very differently. The same could easily 
to ei of different tests of mechanical aptitude. If we propose 
iin the width of a door, we are not dealing with any one 
uive ual’s special view about the nature of width, but with a 
Owe tsal, 1e., an objective consensus. In tests of mental processes, 
tion > we are invariably dealing with some individual concep- 
well ; what is to be measured. And although that opinion may be 
aceon med, or well in line with commonly accepted views, or 
g up by reasonably convincing evidence, it is sure to have 
ou of subjectivity. : wee 
telianit the problem of objectivity, like those of validity and 
ie ility, is not one that has been conclusively settled. No tests 
all sa Dproximate to perfect validity, reliability, and objectivity in 
ticit nses, Still, their practical value indicates a theoretical authen- 
ps y that goes about as far as it can in the present status of 
Ychological science. 


STANDARDIZATION 


reant the establishment of 
Its it yields. This is partly 
1f and partly of inter- 


By the standardization of a test is n 
ms for the interpretation of the resu 
wee of the organization of the test itself e 
si ing: the scores obtained when it 1s applied. These two con- 

erations are interrelated because the scores which result when 
€ test is given depend upon its organization. If it is an age- 
i like the Binet scale of 1908 (see Figure 4) and numerous 

er instruments following the same general pattern, the subtests 


brie grouped at a series of age levels, and the test immediately 


Yields a mental age score. This, of course, means that a process of 
ished. If it is set up like 


S * 
tandardization has already been accompl 


Sc 
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is that this latter group is representative, or a true sample, of 
human beings with respect to intelligence. To return to the illus- 
tration afforded by our hypothetical table of heights, we might 
examine a distribution of the heights of several thousand human 
beings, ascertain how the measured heights of our 54 cases com- 
pared with these, and then conclude that they would compare in 


TABLE 7 


LETTER GRADE VALUES ON ARMY GROUP INTELLIGENCE 
EXAMINATION ALPHA 


(From Yerkes, r927, P. 422) 


-i 
Letter grade A B C+ Cc C— D p= 
Limit of scores | 212-135 134-105] 104-75 | 74-45 | 44-25 | 24-15 | 14-0 
La 
Percent receiv- 
ing each grade 5.14 9.83 18.30 | 28.66 | 21.5 9.19 | 7.38 


the same way with the heights of all human beings. The inference 
might be defensible, but it would be an inference just the same, 
and should be understood as such. 


Exactly the same logic is employed in standardizing and inter- 
preting mental tests. The test is tri 


p or amounts of the function or trait a5 
they appear in that Population. 


r aw Alpha scores just described is probably 
the simplest of all methods of classification. It does not even ca 


F 2 y and could postpone fina 
interpretation until major returns were In; * 


In the construction of most tests this is not feasible. So a special 


| 
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st. DE 
tie group has to be set up, usually at considerable 
group a s trouble. Nor are the raw scores of the standardization 
ae Hee aie thrown into an approximately normal distribution 
Tistead a ratings on what seems a reasonable basis. 
Procedures m mer or more of a number of possible interpretive 
and 25. s adopted, the most typical illustrated in Tables 8 
Ry ceed both tables show the interpretation of raw scores 
the data a their equivalent percentile scores. Turning first to 
the centi] n the O Rourke Mechanical Aptitude Test in Table 25, 
vues e (or percentile) equivalent of a raw score of 317 is 97.7. 
of the staal that a person who scores 317 does better than 97-7% 
listed ee ardization group. So on for the other percentile scores 
Mental Pe age with the Otis Self-Administering Tests of 
abie 8 ility, the scoring standards of which are shown in 
, percentile equivalents are worked out for three different 


TABLE 8 


PER 
CE 
NTILE, STANDARD DEVIATION, AND MENTAL Ace Norms For Ortis 
SELF-ADMINISTERING TEST oF MENTAL ABILITY, HIGHER 
EXAMINATION 


(Adapted from Bingham) 


AY 
Stones STANDARD | STANFORD | General High Coll 
Scores | Binet M.A.) population school ao 
a CA. 18 | seniors students 
i 7.5 19-3 99-4 99 = 9 
A 7.0 18-6 97-7 96 87 
6.5 17-9 93-3 88 69 
5# Bio 17-6 84.1 68 52 
7 5-5 16-3 Goes 2 3 
A 5.0 15-4 59:0 Se zo 
3 45 14-4 30.9 13 6 
i 4.0 13-3 15.9 z 
> 3.0 11-0 2.3 0.3 o 
2 2.5 10-0 0.6 o s 
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groups. A score of 72 on the test is better than that of 99.4% of 
an unselected group of persons 18 years old, better than that of 
99% of high school seniors, and better than that of 98% of college 
students. It should be noted that these equivalents were developed 
with specific groups of given composition, and that they are by 
inference generalized to apply to all unselected eighteen year olds, 
all high school seniors, and all college students. 

Second, both tables show the interpretation of raw scores by 
formulating their equivalent standard scores. This is a different 
basis of classification. The standard score is based on the measure 
of dispersion known as the standard deviation. It is graphically 
illustrated in Figure 29a. The standard deviation is one of the 
measures of the scatter or dispersion of a group about its central 
tendency. It is calculated by adding the squares of the deviations 
of all the scores from their central tendency (usually the “average” 
or arithmetic mean) and taking the Square root of the sum. +5 
„as in Figure 294 and B, the scores are distributed along the base 
lines, any given score can be expressed in terms of its distance i? 
a positive or negative direction from the central tendency or Ce" 
tral score in standard deviation units. When this is done, the stan® 
ard score equivalent of the given raw score is obtained. A ™a¥ 
Score, however, may deviate from the central tendency in two 
directions, positive and negative. But to have both positive 2? 
negative scores is often inconvenient and can be misleading: n° 
the common practice is to multiply the obtained standard dev!4” 
tion by ro and add 50. This has been done in the standard score 
reported in the two tables, and one further step has been addet, 
for the result has been divided by ro. An illustration will serve 
to make the whole procedure clear. The Otis raw score 72 has 4% 
equivalent standard score of 7-5. This has been obtained by multi 
plying the standard deviation of the standardization group bY ra 
adding 50, and dividing by ro. Let us follow these atens backwar® 
75 KX TO 75% 75 = 50°= zg, 2510 = 2.5. So the raw score 
of 72 will indicate a performance 2.5 standard deviation steP® 
above the central tendency of the standardization group. Turning 
again to Figure 29, it is possible to mark off on the base line j¥8 
where this point is located. Notice that like a percentile score, | 
indicates a point below which a certain proportion of the cases i 
the distribution or group fall: and what that proportion is ca? 
calculated if the form of the distribution is known. With a nor™4 
distribution, shown in Figure 29a, a score of 2.5 standard dev!” 
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tions above the mean will be better than about 99% of the cases. 
With the distribution shown in Figure 298, it is better than any 
obtained score, for the best performance is only about 1.25 stand- 
ard deviations above the mean. 

Clearly, then, the use of standard score equivalents, like the 
Use of percentile equivalents, is a method of interpreting raw scores 
by classifying them on the basis of the performance of the stand- 
ardization group. When a given raw score is equated to a certain 
Standard score, what is really said is that a certain proportion of 
the Standardization group exceeds or falls below that raw score. 
And once more the assumption is made that the standardization 
Stoup is representative, so the standard score is taken to mean 
that a certain proportion of all persons to whom the test applies 
falls below the raw score of which it is the equivalent. , 

Table 8 illustrates yet another procedure for the interpretation 
of raw scores, i.e., their classification on a basis of mental age 
values. An Otis raw score of 24 is equivalent to a Stanford-Binet 
Mental age of 12-2 (twelve years and two months). Now the two 
tanford Revisions of the Binet Scale are both age scales. That is, 

€ subtests of which they consist are assigned to certain age 
evels. The statement of equivalence in Table 8, then, means that 
a Person who is able to make a score of 24 on the Otis test will be 
able to pass the Stanford-Binet tests up to the level of 12-2. This, 
of course, was ascertained by giving both the Otis test and the 
tanford-Binet test to at least a part of the standardization group. 
And again the meaning of the statement is generalized by implica- 
as on the assumption that the standardization group is repre- 
entative, 

It would, of course, be perfectly possible to work out mental 
äge equivalents for the Otis scores directly, without any reference 
© the Binet scale. These would be either the mean scores for each 
Successive age level, or the mean ages at which each score was 
attained. This, in fact, has been done in connection with a great 
Many tests, The placement of the Stanford-Binet tests at certain 
age levels was accomplished by a variant of this procedure, as we 
Shall see. The subtests were grouped in accordance with the per- 
°rmance of persons at various age levels in the standardization 
Stroup, So the assignment of menta! age values to test performance 
1S yet another example of interpretation by classification. The 
Classification is made expressly in terms of the performance of the 
Standardization group. This group is considered representative or 
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a true sample. And the interpretive classifications are generalized 
to apply to any persons who may take the test. 

At a later point in this book, and after dealing with the 
numerous and far-reaching questions involved, we shall return to 
a more detailed analysis of these various types of scores and to 
the whole problem of standardization. For the time being, how- 
ever, the important point is to understand its technique and its 
inferential character. The uninterpreted raw scores yielded by any 
test have very little meaning. All that a tabulation of such scores 
can show is what the particular subjects have actually done. In 
order to evaluate this showing as an indication of intelligence, Or 
mechanical aptitude, or musical ability, or what not, the scores 
must be projected against a background of the general perform- 
ance of human beings. Since, however, there is no possibility of 
administering a test to all human beings, or all Whites, or al 
Negroes, or all residents of the United States, or even to a very 
large segment thereof, it becomes necessary to choose a representa- 
tive sample and base conclusions upon it. There is, perhaps, not 
much need to point out that such a procedure is full of pitfalls. 
If the standardization group is not representative—if it is un 
usually able or unusually weak or if it is infected by some socia 
or economic hias—standardization fails, and the derived norms, 
i.e., the classifications of the raw scores, are misleading. Indee@ 
it has been shown that conclusions drawn from some important 
pieces of psychological research involving measurement may very 
well reflect little more than the make-up of the sample of huma” 
beings used as subjects (Marks). Yet this is the virtually unive™ 
sal practice in the standardization of mental tests. 


Conctusions 


‘i i A ra (OF 

This scrutiny of the basic logic of psychological testing ÍS i 
great importance, for it is always necessary to understand JU 4 
what our tests imply and involve if they are to be handled proP 


emerge. These al 
mental tests. 


: d 
1. A given test can only be valid for the population represent?” 
by its standardization group. Even for this population it can 
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valid only if its standardization group is a true sample. For in- 
Stance, a revision of the Binet scale made and standardized in 
England, using English children as the standardization group, was 
applied to Kaffir children. It was found very misleading. Among 
the subtests are some that call for decisions about dates, about 
Coins, and about colors. But the Kaffir children had no experience 
With either dates or coins, their language contains no word for 
yellow, and they have difficulty in discriminating green from blue 

Martin). The question of a universal test of mentality, applicable 
to all human beings everywhere has been mooted, but nothing 
approaching it has ever been achieved because of the enormous 
differences among different racial, national, and other groups of 
men (Schieffelin and Schwesinger). 

2. It is always a matter of doubt whether two tests which 
nominally deal with the same function—intelligence, mechanical 
ability, artistic ability—are strictly comparable. Partly, this is 

cause their authors may use different statistical procedures in 
computing the interpretive norms. But the question chiefly arises 
cause the two tests are constructed with reference to different 
Standardization groups. , i 

3. The basic orienting concepts on which tests are built are 
Never perfectly clear, explicit, and unambiguous. This, however, 
'S not a criticism of psychometrics alone. It is due to the present 
Status of psychological science. As a better understanding of the 
Organization and operation of the human mind emerges, it will be 
Possible to construct better tests. ra 

4. The work of translating the orienting concepts into instru- 
ents of measurement as valid, reliable, objective, and clearly 
interpreted as possible is subject to many limitations and much 
Oubt. It is, however, essential if there are to be instruments of 
Mental measurement at all; and rough and crude though it may 
€, it has proved successful in considerable measure. 

5. It is always dangerous to assume that a mental test can reveal 

Measure intelligence, or aptitude, or talent, or attitude, or 
Personality type in general, or can uncover its universal essence. 

Can only reveal and deal with any such function or trait in the 
Setting of a particular population, though perhaps a very large one. 
Presumably, intelligence has certain fundamental aspects that are 
erthtical in its manifestations among Whites, and Negroes, and 
Chinese, and Russians, and Australian Aborigines. So one may 
8sume that there are at least elements of universality in any well- 
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constructed test of a mental trait or characteristic or TERA 
But to carry the processes of generalization too far is — 
risky, and the risk is greatly increased if one is not aware of whi 
one is doing. 
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Questions ror Discussion 


1. Examine the manuals of several representative tests of wart 
types listed in Bibliography II at the close of the book and Tê 
on the methods used in securing and ascertaining validity. test? 
2. How clearly and adequately do the authors of the above to 
define their working concepts? How successfully do they see” 
select test items to implement them? 


tion ! 
3. Suggest any steps that might be taken in the administratio" 
a test to increase its reliability. 


ds 
n 
4. Which of the causes of unreliability mentioned by Sy™° 
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seem to you to be more or less under the control of the person 
administering any test? 

5. How is the validity of a test affected (a) by its reliability, (b) 
by its objectivity, (c) by its standardization? 

6. What distinction might be drawn between the objectivity of 
the items of a test and of the test itself? 

7. What would be the effect of using the 54 Columbia students 
whose scores are tabulated in Table 6 as the standardization group 
for Army Alpha? : : 

8. Suggest some measures that might be taken to secure a stand- 
ardization group for any test that would be really representative. 

_9. Discuss and consider the implications of Kuhlmann’s claim that 
high reliability is not always desirable in a mental test. 

10. Report and discuss any statements you may have heard that 
seem to imply a belief that tests deal with mental functions “as 
Such” irrespective of their manifestation in some specific group or 
Population, 


CHAPTER II?! 


THE CONCEPT OF GENERAL INTELLIGENCE 
Tue Concert: Irs ĪMPORTANCE AND BACKGROUND 


The whole modern testing movement has centered about the rise 
of a working concept of general intelligence. Pragmatically it stil 
remains the most successful and fruitful of all the operating 
hypotheses that have emerged, both in its effects in the way ° 
test construction and in the results that have accrued when tests 
have been applied to the solution of human problems. By a 
large, tests of general intelligence are still the best we have. PSY 
chometric instruments intended to measure other mental functions 
have been modeled upon them, both in form and in method ° 
construction, and sometimes they have succeeded and sometime 
they have failed. One of the most important psychometric develop” 
ments in recent years has been the emergence of the doctrine tha 
mental organization depends upon an array of separable a? 
definable factors rather than upon a unitary general intelligenc® 
Just how far the theory of factors is a departure from earlie 
views it 1s not yet possible to determine decisively, for the ques” 
tion is still controversial, but probably it can be regarded as an 
extension, or elaboration, or redefinition .rather n as a com’ 
pletely opposing point of view. An increasing number of excellent 
tests are being constructed on the basis of factor theory, 
whether they will prove decisively superior to the earlier tests 3 
as intelligence, which are still in very wide use, it is too 500 
o say. 

The present chapter will be devoted to a consideration of at 
tempts to define, describe, characterize, and delimit this centra 
conception. In the following three chapters an account will 
given of the typical instruments of measurement in which it 
been embodied. Following this there will be an account of test 
of the same general model, but dealing with other mental fur% 
tions and centering about different but not opposing concept, 
This in turn will lead to a consideration of the present status ° 
the numerous issues raised. 

The concept itself needs to be understood against the pack- 
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ground of its emergence and development. It has figured as an 
element in a type of psychological thinking that has become 
Prominent during a period roughly coinciding with the present 
century, Earlier psychology was much influenced by the widely 
accepted hypothesis of mental faculties, which were thought of as 
More or less independent capacities, such as reasoning, memory, 
imagination, emotion, musicality, philoprogenitiveness, and the 
like. The tendency was to try to explain any kind of mental per- 
formance in terms of the isolated action of the appropriate faculty. 
The faculties were often referred to as “mental muscles,” and 
a ere was a current belief that they could be strengthened or 
trained” by appropriate exercises. It was assumed that the ability 
°F potentiality of any person was due to what would now be called 
the profile of his faculties. This view was the logical basis of the 
Practice of phrenology, which, however, involved the further as- 
sumption that the faculties of the mind could be determined by 
he contour of the skull. Phrenology never had full status in the 
est psychological thinking, but it was accepted enthusiastically 
Y many very able and serious men—for instance, Horace Mann. 
either faculty psychology nor phrenology directly influenced 
e development of -psychological testing, which was carried to 
Considerable lengths during the nineteenth century, but they 
affected it indirectly. Tests were constructed to measure very spe- 
cific and limited functions, many of them psychophysical, such as 
“uditory acuity, attention span, voluntary attention as measured 
Y the cancellation of certain letters in printed material among 
Other means. Such tests might very readily be objective, reliable, 
2nd in a narrow sense valid, for they measured what they were 
“tended to measure, but they lacked any broad significance or 
Utility. Tt was found that they had, for instance, very little rela- 
tionship to success in school, or to any of the higher and more 
complex and important manifestations of mentality. It cannot be 
“aid that they were derived directly from faculty psychology, for 
ley did not undertake to measure these alleged independent men- 
E entities, But so long as this view prevailed, it was a major 
s patacle to the development of a different and better type of psy- 
Metric instrument. i 
© central element in the change that has taken place is the 
tence that the mind or personality always acts as a whole, and 
1 segments, and that investigation must always take account 
ts total action if any real understanding is to be achieved. Such 


Insis 


of j 


74 PSYCHOLOGICAL TESTING 


terms as memory, reasoning, imagination, feeling, and so fortl 
are of course retained. But they are thought of as functions of thé 
entire personality, ways in which the mind as a whole deals with 
situations, and not as separate subdivisions or unitary capacities. 
One consequence has been the abandonment of the doctrine 0 
mental training, or formal discipline, which assumed the possi- 
bility of strengthening separate faculties by exercises in which thé 
content and setting were matters of indifference. And long before 
this viewpoint received its characteristic present-day expression iñ 
the work of configurationalist, or holistic, or Gestalt psychologists, 
it led to the rise of a different type of mental tests. 

This evolution of psychological thought and practice, in so fat 
as it affected mental measurement, has been associated with thé. 
name of Alfred Binet. As early as 1895 he became interested in the 
development of a type of test very different from the then curren 
measuring devices which dealt with narrow and highly oases 
functions. He worked out tests for general memory, mental we 
ery, and comprehension. But as yet his guiding and opera se 
concepts were very vague. He was still feeling his way, and t” 
early tests were largely fruitless. 2 ih 

The turning point in his career came when he was commission S 
to investigate the capacities and possibilities of children in Ei 
Paris schools, particularly those of low mentality, and to m 
means for differentiating at an early age those who had educ 
tional promise from those who lacked it. It is very notable a 
Significant that the modern testing movement has its source = $ 
practical problem. Binet’s first solution to the problem which sts 
set for him was his earliest more or less complete set © 
which are shown in schematic outline in Figure 2. of 

The 1905 tests which Binet published and used contain many is 
the features of present-day scales. Some of them are ho ee 
from an earlier day, such as the comparison of Jengths of lines "A 
the comparison of weights (numbers 2r and 22). But the er 
emphasis is upon true intellectual tasks, instances being the m 
wrapping of candy, the execution of commands, naming of ¢ act 
mon objects, sentence completion, paper folding, defining abstr 
terms, and numerous others. These tests, as will be seen, Were? 
formally arranged as an age scale, but Binet was already ae 0 
of the significance of age differences and their relationshiP of 
mentality. He pointed out that children of the same age but fe 
different mental ability will pass different numbers of tests 
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Claimed that imbeciles could pass those from 7 to 15, and that the 
Various higher levels of mentality would show differential success 
With the rest, And above all, in spite of the inconsistencies noted 
“Dove, which were eliminated from his later work, it is clear that 
it had in mind a comprehensive survey of the individual mental- 
i Which had little to do with psychophysical processes and 
ier Was in effect a break with the assumption of separate 
al faculties. It is very noteworthy that this extraordinarily 
Pregnant idea was put into operation almost casually, without any 
ete Or doctrinal trimmings, as the obvious answer to a prac- 
Cal challenge, ; 
Fir p concept of general intelligence, then, stands for two E 7 
A hae It stands for a way of appraising and oe aui peop! . 
man being is not to be understood as a composite of specia 


-a acule? sepe R imita- 
ti ‘ties. Tf he is to be appraised at all, if his promise and limita 


ds ¢ é i eneral 
Over are to þe assessed, this should be done in terms ofa g 


te ae evaluation of something 
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Ability to follow a moving object with the eyes. 

Grasping a piece of wood brought in contact with the hand. 
Showing a piece of wood to see if child will grasp it. 

Choosing between piece of chocolate and piece of wood. 

Candy wrapped in paper presented to see if unwrapped. 
Execution of simple commands and imitation of simple gestures. 
Knowing names of parts of body and simple objects. 

Indicating things in pictures in answer to questions, 

Naming common objects in picture. 

Telling which of two lines is longer. 


. Repetition of three digits. 
. Telling which of two weights is heavier. 
- Asking for objects not present, for things in picture by nonsense 


word, comparison of three unequal lines, then three equal ones: 
(suggestibility) 


+ Definition of objects, 
- Definition of sentences, 


Indicating differences between pairs of objects. nass 
after which child recalls as many as possible. 

Drawing designs from memory after ten seconds’ exposure. 

Repetition of digits. 

Indicating resemblance between pairs of objects. 


. Comparison of lengths of lines, 


Comparison of weights, + hts 
Memory for weights shown by remembering which of weg 

placed in order is missing when one has been removed. 
Finding rhymes to given words, 


. Completion of sentences, 


Making sentence including three given words. 
Comprehending questions graded from easy to hard. 


. Reversing clock hands from memory. 


P : - ing 
Cutting a triangular piece from Paper folded twice, task ber 
to tell what it will look like unfolded. 


. Defining abstract terms. 


Fic. 2, Tue Frrst BINET Tests (1905) 
(After Pintner, 1931, pp. 136-37) 
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Kindergarten, also with very happy results (Webb and Shotwell). 
Could these wise and successful adjustments have been made if 
there had been an attempt to estimate the patterns of faculties 
Manifested by these two children? Almost certainly not. Would 
tests of visual acuity, auditory acuity, attention span, differential 
threshold for weight, and the like have furnished a proper basis 
or decision ? Again the answer is no. But a general over-all survey 
of mentality provided what was necessary. Is it to be anticipated 

at some day there may be tests not based on general intelligence 
as now understood, which would tell the story better and reveal 

e facts more discriminatingly ? This is entirely possible. Faculty 
Psychology Provided a poor set of working conceptual tools. Our 
Present holistic Psychology provides a much better set. Still better 
©nceptual instruments will very probably be developed in the 
uture, such indeed being precisely the hope and intention of our 
resent analytic techniques. This is the only essential point either 

t Scientific or Practical purposes, and the “reality” of the con- 

tsisa question which need not arise at all. 


DEFINITIONS OF GENERAL INTELLIGENCE 


op lany writers on the subject of general intelligence have felt 
‘Gated to put forward compact and more or less formal defini- 
as it. This, no doubt, is a praisworthy endeavor and an 
rg tant one. Although the results seem somewhat confusing at 
tti Sight, valuable insights can nevertheless be gained from them. 
Preh certainly not necessary here to attempt anything like a com- 

€Nsive Catalog of such definitions, but a reasonable number 
the; ™ples is well worth considering. Here are a few such with 

«y, Aonological arrangement indicated. ; 

toa elligence is a general capacity of the individual consciously 
Sort JUSt his thinking to new requirements (Stern, 1914). “Any 
time° attentive memorial or perceptive activity is at the same 
adj an intelligent activity just in so far as it includes a new 
'Stment to new demands” (same author). “Intelligence means 

the Property of so recombining our behavior patterns as 
Seem, better in novel situations” (Wells, 1917). “Intelligence 
‘dex to bea biological mechanism by which the effects of a com- 
ecg. Stimuli are brought together and given a somewhat unified 

) In behavior” (Peterson, “Intelligence and Its Measurement, 

` 10 be intelligent a test subject “has to see the point of the 
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problem now set him, and to adapt what he has learned to this 
novel situation” (Woodworth, “Intelligence and Its Measure- 
ment,” 1921). “Intelligence is the ability to learn” (Buckingham, 
“Intelligence and Its Measurement,” 1921). “An individual pos- 
sesses intelligence in so far as he has learned or can learn to adjust 
himself to his environment” (Colvin, “Intelligence and Its Meas 
urement,” 1921). “An individual is intelligent in proportion as he 
is able to carry on abstract thinking” (Terman, “Intelligence an 
Its Measurement,” 1921). “We may then define intellect in ge” 
eral as the power of good responses from the point of view 0, 
truth or fact” (Thorndike, “Intelligence and Its Measurement, 
1921). Thurstone (1923) characterizes intelligence as a movement 
from trial and error towards increasingly abstract controls. “Jn 
telligence may be regarded as the capacity for successful adjust 
ment by means of those traits which we ordinarily call intellect?" 
These traits involve such capacities as quickness of Jearnine! 
quickness of apprehension, the ability to solve new problems; al 
ability to perform tasks generally recognized as presenting we 
lectual difficulty because they involve ingenuity, originality, t 
grasp of complicated relationships, or the recognition of rem á 
associations” (F. N. Freeman, 1925). “Intelligence is the ability 
to learn actions or to perform new actions that are function 
useful” (F. N. Freeman, 1940). 

From this brief survey of typical attempts to define 
telligence several significant points arise. if 

oq: : 3 ie put 9 

1. The prevailing impression is not one of contradiction > put 
vagueness. It is true that there are differences in emphasis’ ie 
on the whole it would not seem very difficult to reconcile 
various formulations. Indeed one might not unjustifiably sup tes 
that the authors represented would, on discussion, find thems 
in substantial though no doubt not entire agreement. hazy 
that they are dealing with a conception that is essentially dow? 
although no doubt meaningful, and that cannot be pinne 
within the scope of a compact statement. 

2. In order to bring out the common meanings and 
divergencies among these definitions, at least two important pu? 
ers have tried to assign them to a scheme of classification: 0 ah 
Pintner (1931, pp. 47-51) groups them as biological, educat ipe 
faculty, and empirical. Again F. N. Freeman (1939, 1940) clas? g t 
them as organic, i.e., those which characterize intelligence e 
characteristic of organic constitution ; social, i.e., those whic 
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Phasize its dependence upon symbols and cultural concepts; and 
ehavioristic, i.e., those which define it in. terms of performance 
on a given test. Unfortunately, neither, scheme of classification 
Seems very enlightening or very successful in bringing a tidy order 
Sut of a bafiling array òf formulations. The various groupings 
Overlap, arid it is often hard to say just where any given definition 
Should be finally assigned. What Pintner and Freeman have ac- 
tually succeeded in doing is to play up the indubitable truth that 
Seneral intelligence is a vague concept and that it may properly 
€ considered from many different though by no means conflicting 
Viewpoints, 

3- The chronological pattern of these definitions, which was 
*mphasized in presenting them, is perhaps more revealing than 
any attempt at a topical classification. As will be seen, they ex- 
tend all the way from 1914 to 1940. The impressive thing that 
Merges from a chronological survey extending over twenty-six 
Years is the repeated appearance of the same points, with only 
Minor additions and shifts of emphasis. To refer to a specific case, 
“Teeman’s formulation in 1940 is substantially the same as the 
One he offered fourteen years earlier, except that in the former he 
jnentions functional usefulness. Presumably this would mean that 
telligence manifests itself more adequately in administrative 
“cisions or scientific research than in chess or crossword puzzles. 
i Ut even this is not certain, because functional usefulness can take 

a great deal of territory. i ee. 
te” There is only one type of definition of general intelligence 
di Which express exception must be taken, and no instance of it 
Ww Curs in those cited above. This is any definition or description 
ex, ‘ch claims that intelligence is hereditary. Boynton (q.v.) for 
g ample, begins his fourfold description by stating ‘that intelli- 
moce is ian hereditary capacity.” The objection to this is not 
bur «the Proposition is false, which may or may not be the case, 
R that it prejudges issues which can be settled only by investi- 
fice if they can be settled at all. To begin by defining general 

lligence as hereditary is to beg a whole range of momentous, 

sical, an i uestions. , 

The r a p n give the reader a fairly adequate i 

may Of attempts to reduce generai intelligence toa oo le 

thay )°tY well feel that so far as he is concerned, the aprons 
e sti rhat it is. If he undertakes to 

Make ~ Still does not know exactly wi ects 
a more exhaustive study of the literature to whi 
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are made in the text and the bibliography, this impression is likely 
to be strengthened. It is perfectly correct. That, indeed, is the 
essential point. General intelligence is a loose and vague concept. 
It is not the same idea as Thorndike’s “altitude of. intellect,” 
which means precisely the level of difficulty that can be attained 
in a graded scale of intellectual tasks. Nor is it the same idea as 
Spearman’s “general factor,” which, so far as testing is concerned, 
means precisely the education of relationships and the education 
of correlates. It was used by Binet as a loose, vague, but still 
significant guiding hypothesis—as a warrant for assessing human 
beings in terms of a comprehensive survey of their performance 
on what would ordinarily be considered intellectual tasks. And it 
has been so understood and used ever since. The ultimate defense 
of this concept is not its theoretical clarity but its workability. It 
shares the pervading vagueness of all holistic or configurationalist 
psychology. But perhaps nothing more than vagueness is possible 
‘in the present state of our psychological knowledge. PerhaP® 
attempts at too great precision are premature, and end only in t e 
production of triviality and falsehood. At any rate, the net test” 
mony furnished by the testing movement to date is that the holis” 
tic point of view, indefinite though it may be in many respect? 
actually pays out in appreciable practical success. 


Descriptions OF GENERAL INTELLIGENCE 


The attempt to compress the concept of general intelligen®® 
into a compact definition is no doubt attractive. But it is too co ch 
plex, too many-sided, too wide-ranging, and too vague for SU 
treatment to succeed very well. Accordingly, there have emerge 
various attempts to describe it in a more comprehensive fasho i 
And here indeed divergencies that are momentous, both theoré 
cally and practically, decisively appear. d 

Stoddard presents an elaborate descriptive analysis, me 
under seven headings. “Intelligence is the ability to underta” 
activities that are characterized by (1) difficulty (2) comple*! ) 
(3) abstractness (4) economy (5) adaptiveness to a goa in 
social value, and (7) the emergence of originals and to mainte f 
such activities under conditions that demand a concentratio? = 
energy and a resistance to emotional forces” (Stoddard, 19 
p- 4). 5 tio” 

Here is a comprehensive statement that merits careful atte?! ity 
to all its details. Difficulty must not be thought of as an abili 


Sa a 


nE 
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to do unusual freak tasks, such as the feats of the lightning calcu- 
nog OF the defining of extraordinary words, or “information 
Boe memory stunts. It means the capacity to perform high- 
orga intellectual tasks such as those of higher mathematics, highly 
aon artistic or literary production, strategy, administrative 
tive fsa and the like. Complexity is not a mere matter of addi- 
tions ange. It refers to the ability to hold together many considera- 
skil ue a unitary effort, such as manifests itself in any high-level 
of al hf in complex research. Abstractness is the key characteristic 
en, high-level mental operations. It means freedom from the 
Tree ediate, from trial-and-error processes, such as is gained by the 
natka of verbal and other symbols. It is well typified by the 
tonan eal sequence to greater abstractness from arithmetic, 
oe algebra, to calculus, and beyond. Thus the above three 
no å Cteristics all have to do with mental organization, and can 
echniges be quantified to some extent by existing psychometric 
S: 
otoddard finds economy a better word than speed, for it means 
ng towards a goal or performing a task without irrelevancies. 
ine tiveness has always been recognized as a characteristic of 
Bacon behavior. The emergence of originals in working meth- 
Men a results is highly characteristic and indicative of superior 
tration S but is hardly recognized in psychometric tests. Concen- 
Ever pot energy on a purpose highly important. It must not, how- 
rather e interpreted as a blind sticking to assigned tasks, but 
M ore as self-direction and persistence in significant endeavor. 
Over, intelligence involves resistance to emotional blockages 
Istractions, such as those coming from popular shibboleths, 


self tising slogans and claims, prejudices that ignore reason, 
of “distrust, and so forth. The mention of the social significance 
eristic of much modern 


EOR lems is noteworthy. It is characteris e 
bei S t on the problem and nature of intelligence, the suggestion 
bet a » aS previously remarked, that intelligence manifests itself 
Cho; t in the social planner than in the chess expert. Also the 
total 5 of Meaningful tasks and goals is regarded as part of the 

eiatiins of what we call general intelligence. | i 
Ynton (q.v.) offers a less analytical and extensive characteri- 


Zatio . . . 
his qX (@) Intelligence is hereditary according to the first point in 
h to fundamental sys- 


te “scription, This, as we have seen, is open ta 
desc E Objection. It ought not to appear in a general definition or 
nl “Ption, quite apart from its truth or error. (b) It involves not 
aptation but reconstruction. Here again 1s the emphasis 


ch 
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upon original solutions and constructions, though differently 
phrased, for reconstruction clearly indicates the discovery of new 
solutions and the alteration of circumstances instead of the mere 
adjustment to them. Boynton, like Stoddard, remarks that this 
aspect of intelligence is not much recognized in tests. (c) The best 
indication or criterion of intelligence is the behavior of the indi- 
vidual in his group. Once more, from a somewhat different angle, 
we have the emphasis on social significance in the tasks or activi- 
ties in which intelligence manifests itself. A child may appear 
stupid in school or in test situations, but not so in a satisfying 
social setting. (d) A characteristic manifestation of intelligence § 
to look beyond the temporary and to envisage alternative group 
needs. This, to repeat, is definitely a less complete characterization 
than the foregoing. The emphasis on social significance and group 
action as related to intelligence is noteworthy. 

In sharp contrast with the descriptions just summarized is that 
of Thorndike (Thorndike and Others, 1927), which has already 
been briefly mentioned in these pages. For Thorndike, intelligenc? 
has four attributes—level, range, area, and speed. (a) Level (9 


ref c at can be performed. #7 
he puts it, if all the intellectual tasks in the universe were @! 


intelligence, and nothing can substitute for or offset it. It cannot 
however, be measured entirely in isolation. (b) The range ° be 
telligence refers to the number of different tasks that Ca” n° 
achieved at any given level. Again, slightly to paraphrase T ‘al 
dike s formulation, he puts it as follows. If all possible intellec al 
tasks in the universe were rated for difficulty, all those on ea 
titute range. He considers thé 

e to do all conceivable tasks 0” ot 
admits that practically this 15 cey 
Pportunities to learn. For insta e- 
expert should be able to solve me to 
gic or administrative problems "Yj. 
the difficulty level of the chess situations he can kandle, In i” is 


gence tests, according to Thorndike, range is represented by tror- 
of different kinds but of equal dimculty, and range and ievel ce 


relate almost perfectly. This means the claim that compê enc? | 
and versatility go together almost perfectly. Inborn intellig 1 


is what determines level, but it cannot be measured without 1” 
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ducing Tange, i.e., without introducing tasks of more than one 
kind. (c) Area is simply the summation of all ranges. It is not, for 
horndike, very important for measurement. Clearly this is im- 
Plied in his whole position, for if a person can do all kinds of tasks 
on his top level of difficulty, there is no point in giving him numer- 
us diversified tasks that he can do easily. He points out that a 
Privileged and an underprivileged child may both have the same 
evel, because they are identical in inborn ability, but they may 
differ in the area of tasks they can compass because one is helped 
Y wide training and experience and the other is not. For him, the 
Much mooted effect of training in a good nursery school would be 
to increase area but not level. (d) Speed is a significant aspect of 
‘telligence, but less closely bound up with the essential attri- 
ute of altitude or level than are the other two. i 
It is highly instructive to compare these three descriptions, for 
they Manifest divergencies that are both startling and revealing. 
„I. The clear meaning of Thorndike’s formulation is that the 
King of job on which intelligence operates, and in and through 
Which it is revealed, does not matter, except in so far as it is an 
intellectual” task. In particular social significance, meaningful- 
Ness to the subject, interest, and so forth do not matter, at least 
m theory, So far as intelligence is concerned, the diffculty of the 
ask is by all means the primary consideration. This, of course, 
explains the building of the I.E.R. Intelligence Scale CAVD, which 
Consists of four lengthy series of four kinds of tasks—completions, 
Arithmetic problems, vocabulary, and directions. . 
2. Difficulty is thought of by Thorndike in terms of isolated 
isolable jobs that are ranked in order according to the per- 


centages of a standardization group by which they are passed. Tt 


'S Very different from Stoddard’s conception, which associates diffi- 


nlty with complexity and abstractness as one phase of a unitary 
Process of mental organization. Also, Thorndike’s assumption 1s 
at when a certain order of difficulty is established it will be 
“oretically the same for all persons, again at least in theory. Of 
Ourse the idea of isolating all conceivable intellectual tasks and 
ranking them in order of difficulty almost reduces one of the 
“mon techniques of test construction to an absurdity. Test 
poms are very frequently ranked in difficulty in terms of the per: 
rmance of a standardization group. But to equate this an R 
A a serial order of all possible tasks is a very formidable logica: 
ap. 


3- Consistently with the assumption that any kind of task is 
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equally indicative of intelligence so long as it has a stated level 
of difficulty, Thorndike takes no cognizance of reconstructive 
activities or the emergence of originals. 

4. Speed apparently means for Thorndike the same thing witb 
many easy tasks, or in the attack upon exacting, complex, disturb- 
ing and bafiling problems. But surely there is an essential differ- 
ence. In the first case we have merely a process of rapid enumera- 
tion—the doing of one thing after another at a given rate. In the 
second situation we have choice among alternatives, recognition 
of choices that would be hopeless, vital exploration, vital decisions, 
an awareness of relevance. And here rapidity may be highly im- 
portant and highly indicative. In this latter situation, in othe 
words, intelligence is manifested by what Stoddard calls “econ 
omy” (see also Kuhlmann, 1939). r 

5. Range and area are expressly characterized by Thorndike A 
terms of the number of tasks of different kinds that occur ae 
can be done. Apparently painting with oils and with water © $ n 
on the one hand, or painting with oils and fixing an automobile aa 
the other would be pairs of tasks that would have the same valu 
and meaning with regard to range. ae 

Thorndike’s description of intelligence is one of the best E 
stances of what critics have in mind in saying that testing is base 
on a mechanistic and atomistic psychology that ignores the variety: 
subtlety, and organic unity of the human mind. But it is bY "e 
means the sole or necessary logical foundation of all psychomet", 
instruments which purport to deal with general intelligence. -ip 
eneral formulation of the ideas about which a certa 
sts was built—I.E.R. Intelligence Scale CAVD. B 


for instance, expressly denies a cardinal por 
we 
f 


an explicit g 
battery of te 


McNemar (1942), c r 
in Thorndike’s doctrine, namely the very high correlation bet 


level and range. Put in other words this means that the kind | 
tasks set up is important to a large extent independently of th i 
difficulty, so that a person’s mentality will be revealed bette! a 
one kind of undertaking than another. This is why many sca a 
including that of Binet and its major revisions, actually or 
wide variety of subtests. Again Stoddard and Boynton set Oe. 
concept of intelligence broader by far than that of Thorn ast 
They emphasize much in intelligence that does not appear 4 / 
directly in any tests. The reason why these factors are nor “ase 
bodied in our instruments of measurement is not that théy us? 
lacking in significance, or importance, or authenticity, but be 
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the Means of translating them into valid, reliable, objective and 
ee ey standardized tests are not available. Nevertheless our 
= 7.although limited, really do contain much of the essence of 
fe eillgence, as their successful application sufficiently shows. The 
ason why LE.R. Intelligence Scale CAVD is a serviceable instru- 
Ment, which in fact it is, is not because range correlates almost ~ 
Perfectly With altitude so that in theory only difficulty matters, 
t because it contains an array of tasks within the “range” of 
Many and indeed most of the human beings for whom the test is 


ta tended, and because these tasks really do embody many impor- 
nt characteristics of general intelligence. 


EMPIRICAL CLARIFICATION OF CERTAIN Issues 


Some of the basic issues in connection with the conception of 
Zenera] intelligence have received a measure of empirical clarifica- 
on by experimental and statistical analysis. The results are 
Worth noting. 


Intelligence and learning capacity 
ps will be seen from the definitions quoted above, or by refer- 
ofte to the literature mentioned, general intelligence has quite 
x en been considered as identical with the capacity to learn. What 
such mental evidence there is suggests quite strongly that any 
Statements must be severely qualified. , y 
d °hnson (g.v.) gave 60 college students ro minutes practice a 
Y on mirror reading. He obtained correlations between perform- 
isi on this task and scores on a number of standard intelligence 
iu S. Mean intelligence score correlated .34 + 08 with an meee 
oyaiber of words read per diem, and .46 + .07 with aes 
oa 42° days. He found in addition that those in the ha al 
th Mtelligence improved in mirror readings more than those is 
tie Ower half. Joseph Peterson (1922) found a moderate correla- 
€tween his rational learning test, shown in Figure 3, and 
r reson intelligence tests.* Jordan (9.2.) administered four 
oup intelligence tests and also the Stanford Revision of the 
# j not look at 
Koen Ss ok ae the subject 


S, hi ji he goal at H. Whenever an error occurs, as 

be, hosing tiene See ad of O, and so forth, he goes back to the 

Rinni t yee! ` N V. The test involves a rather unique 
Combin S and starts again with ths pair N V. l t 
Mation of rote Taring and comprehension of a general plan or layout. 


Sco; 


I 
the dia 2e rational learning test, or “ 
Chooses otam. The examiner calls off t 
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Binet Scale to 64 high school pupils, and also an “ideational learn- 
ing test.” This latter test presented a series of pairs of letters at 
various distances from each other in the alphabet, eg, MO, BW, 
etc. The letters were indicated by numbers corresponding to their 
serial position in the alphabet. (In the above two instances, the 
pairs would be indicated as 13 15, and 2 23). The task was to learn 
to write the letter midway between the two in each pair. Twelve 
practice periods of 3 minutes each were given during an hour. The 
correlations between intelligence test scores and this rather com- 


Fic. 3. RATIONAL LEARNING TEST (MentaL MAZE) 
(Peterson, J., 1922) 


$ x r he 
plex piece of symbolic learning were “uniformly low” for th 


various groups used. They centered around .20, and ranged pon 
.31 to —.125. Smith (q.v.), working with a group of 95 subject 
whose average mental age was 9 years, gave them practice 
spatial and perceptual tests, and found the gains made were ea 
related to mental age. Grace McGeoch (q.v.) again has present $ 
some evidence that brighter individuals tend to benefit more aah 
the duller by using the whole method of learning. She worke "the 
two groups of children, 30 in each, from 9 to ro years old. 
mean intelligence quotients of the two groups were 99 ane ou 
Both groups were set to learn series of Turkish-English voc, 


; ning 
lary pairs, and also ten-line poems. Three methods of learn! 2 


were used—the part method, the progressive part method, an oth 


whole method. The whole method was found superior for 
groups, but markedly so for the brighter. On the vocabulary 
ing the progressive part method was superior to the pure Piy 
method for the brighter subjects. McGeoch accounts for thi p 
the fact that the brighter subjects had a better grasp of Pê Mc 
and better mental organization on the job. Once again, J. 2 pas 
Geoch (q.v.), working with children from 9 to 14 years of a8® pil 
found only a slight relationship between intelligence and thé a 


arn 
le i 


d 15% 
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lty to make a correct report of objects and events that have been 
observed, 

This work is evidently none too conclusive. The learning tasks 
are, for the most part, of a very limited and artificial type. Smith, 
for instance, gives what she calls an “operational” definition of 
earning as gain due to practice on a test. Such notions hardly 
Correspond to the broad conceptions of “adjustment” and “power 
to acquire new skill and insight,” which those who equate intelli- 
8ence with the power to learn presumably have in mind. More- 
Cver, it seems likely that except in one and perhaps two of the 
€xperiments, the total range of intelligence among the subjects was 
Tather harrow, since college and high school students were largely 
Used, the notable exception being the investigation by Smith. This 
Would tend to attenuate the obtained correlations, which there- 
ore might very well underestimate the true relationship. But it is 
Clear that there are serious difficulties in the way of an out-and- 
fen definition of general intelligence as meaning the power to 

arn, 

Woodrow (q.v.), in a broad survey of the evidence, makes the 
ollowing points. (a) There is little relationship between practice 
fins in dealing with spot patterns, rearranging letters, cancella- 
tons, making tallies, etc. (b) As far as school learning is con- 
cerned, there seems to be no very decisive relationship between 
arning in the separate subjects considered one by one and gen- 
eral intelligence. (c) Little or no relationship has been established 
è “Ween speed of learning and general intelligence, Ea me 
ty, dence here is quite limited. (d) There may very well be a re EF 
leap hip between certain mental factors and certain types o 

ning, f P 

š Some years ago Pyle (1925, 1928) on the basis ofa ean 
l dy of 50 kinds of learning, endeavored to show that a genera 
earning Capacity exists, which he considered as equivalent to 
ip tentiveness, Later results, however, as Woodrow points out, 
c row much doubt upon the concept of general learning capacity. 

arly, we cannot define intelligence as equivalent to anything of 
5 © king, What is indicated is a more analytic approach to the 
thoolem, or on a priori grounds it would seem almost incredible 

t there could be no close relationship of any kind between 


Me 
Mtality and learning. 


Intelligence and personality type Bees 
Over the period of the past twenty-five years results indicating 
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a relationship between intelligence and type of personality have 
from time to time been published. Although they have a crucial | 
bearing upon the concept of general intelligence itself, and also 
upon the problem of translating it into suitable instruments of | 
measurement, they have received little notice in discussions of | 
these topics until very recently. 

As long ago as 1920 Wells and Kelley (q.v.) demonstrated that 
the various subtests of the Stanford Revision of the Binet Scale 
elicit different responses in normal and in psychotic persons. For 
both groups the vocabulary test has a high stability, and also the | 
subtests which call for immediate memory for digits. But there 15 
a marked difference in the tests at the ten-year level calling for 
the drawing of designs from immediate memory and for reporting 
on the thought content of a paragraph read by the subject, 2” 
also on the Ball and Field Test which is placed at the twelve-year / 
level. On these tests Wells and Kelley found that functional P5. 
chotics tend strongly to perform less well than normal subjects ; 
comparable mental ages.* Raymond Cattell (1945 b), 284" e 
his elaborate factorial studies of personality, finds intelligeD® | 
to figure as a general factor among traits, associated particular 
with character traits, and more specifically still, with good habits, 

A report by Piotrowski (g.v.) expands the topic. He finds tha 
most mental disorders have a selective effect upon Stanford-Bin®* 
subtest responses. Many of the subtests are more difficult for ps tA 
chotics than for normals, and the effect differs with differen 
psychotic categories, i.e., schizophrenics, depressives, paranoii , 
and so forth. The differences between these types of person his 
appear as differences in the “profiles” they make on the scale. jike 
is a term that requires some explanation. The Stanford-Binet, is, 
all the direct revisions of the Binet scale, is an age scale. That i 
its subtests are grouped at stated age levels. In measuring @ : 
ject, the examiner starts him at a point where he should be 4”. 
succeed with all subtests, and continues on to the level of £4! u ae 
Before the upper limit is reached he usually fails in certain 0" jc 
tests while still succeeding with others. It is this pattern © 


«vers 
* The scale referred to is the first of two revisions made at Stanford Univer 
under L. M. Terman. The second revision is entitled the Revised Stanfor and 8 
Scale. The latter appeared in 1937. It is described on pp. 97-118 below eh 
partial synopsis appears in Figure 5. For the Stanford Revision subtest? calle 
tioned, see Terman (1916), where they are described in full. The subtes, „yis 
Ball and Field in the Stanford Revision is renamed Plan of Search in the 
Stanford-Binet. 


cá 
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cesses and failures that is referred to as the “profile.” A psychotic 
person will have a special tendency to fail in certain subtests, and 
these vary in accordance with the type of psychosis. Thus there 
may appear a characteristic “psychotic profile,” and such may be 
found in subjects who do not manifest a true functional psychosis. 
Vhen this happens, it is clear that there are two consequences. 
Arst, the scale will underestimate the intelligence of the subject 
as registered in his attained age. Second, it will be possible to 
©Monstrate a qualitative difference in his response to situations 
“Manding intelligent responses. 
apaport, Gill, and Schaefer (q.v.) have recently published very 
ytensive data on the same subject. Instead of one of the Stanford 
visions, they used the Wechsler-Bellevue Intelligence Scale.* 
'S is not an age scale but a point scale, made up of 11 subtests 
ach much more extensive than the Stanford-Binet subtests, and it 
can be used for ages up to 60 years. The authors report that 
the “scatter” ofa subject’s showing, i.e., the pattern or configura- 
lon of his weighted scores on the separate subtests, is related to 
Is Personality type. The scatter pattern differs as between nor- 
î S and psychotics and as between different types of psychotics. 
“Ychotics, that is to say, find certain of the subtests particularly 
difficy t, and so make a poorer showing on them than normals of 
Presumably comparable mentality. Thus schizophrenics (unclassi- 
ted) have no special difficulty with the information subtest, or 
with that involving digit span. Arithmetic, however, is much im- 
Paired, There is little impairment in similarities. Picture arrange- 
o g and picture completion are greatly affected, block design 
sily Slightly, object assembly is impaired to an extreme degree, 
aie igit-symbol somewhat. In general, the authors — such 
he ects definitely impaired in language tests calling for compre- 
vusion, and markedly impaired in performance tests involving 
z Organization, Similar findings are reported for the coe 
fein, Ories of paranoiacs, preschizophrenics, depressives, and neu- 
m : S, and their subclasses. The authors also find Ga mny, ar 
Psyc ho viduals yield scatter patterns TAE those of the 
otic t vhich they correspond. f 
finch af tls maak is contrac, and the details BA 
i 8S are subject to revision and correction. But the point here 
“that it fore intelli as one uniform function, 
It forbids us to regard intelligence 


Pres, his instrument is discussed on pp. 129-136 below, and a synoptic outline is 
i Figure 10. 
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identical for all persons and situations. It is by no means a mere 
matter of altitude, or of graded difficulty of the tasks an indi- 
vidual can accomplish. Its manifestations have within themselves 
qualitative differences, and these are associated with the emotional 
and personal make-up of the individual. Thus Thorndike’s as- 
sumption that all tasks of a given range, i.e., of a given level of 
difficulty, are really equivalent, and that the only reason why 4 
given person will not perform all such tasks equally well is the 
“accident” of his training, experience, and opportunity cannot 
be maintained. 

In the second place, these findings have a practical bearing 
upon test construction. It will always be preferable to have an 
instrument that can reveal these qualitative differences, which 
may be considered true differences in intelligence, as clearly a5 
possible. Such an instrument will block out different types of tasks 
clearly and unmistakably, and embody a wide range of such type 
The two Stanford Revisions fulfill the latter requirement ve 
well. They contain a large number of quite varying subtests. 
they do not fulfill the former as well as the Wechsler-Bellev” 
scale, in which eleven major subtests are blocked out. This, "° 
doubt, is why Balinsky and Wechsler (q.v.) were able to sho’ 
that the Wechsler-Bellevue scale is superior as a clinical an 
diagnostic instrument to the Stanford-Binet scale. These a 
criteria would also mean that the clinical value of the i 
Intelligence Scale CAVD is not great, for it contains four a 
of verbal tasks which seem closely related on inspection, pis 
which have been demonstrated as such by Thorndike a” 
associates. aiad 

To sum the matter up, an intelligence test may yielda $ r 
or over-all total score, such as a mental age, or a percentile, the 
standard score, and this may be considered representative j] the 
subject’s level of intelligence. But such a score does not $” once; 
whole story. There are also qualitative differences in intens gir 
and if a test covers them up, it is to that extent defective an 
sensitive to the objective facts. 


3. Estimating intelligence 


Considerable light on the conception of general in 
comes from investigations of the problem of estimating 1” 
without the use of tests. It has been shown again and ag 
unchecked estimates by teachers and others are extremely Y” 


. once 
tellige" e 
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tellig pat 
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Worthy. Several different estimates of the same person do not agree 
Well, and estimates of groups do not agree with their showing on 
Standard tests, Pintner has shown that when teachers try to esti- 
Mate the intelligence of children in school, they are very much 
influenced by school performance, and that their ratings are closely 
related to the grades of these children (Pintner, 1931, pp. 294-95). 
Magson (q.v.) reports an attempt to rate the intelligence of 876 
Persons on a 7 point scale, for the purpose of assigning them to 
Tee places in the British secondary schools. Ratings were made 
On the basis of an interview by judges who were not acquainted 
With the candidates, and their intercorrelations were about .15, 
Which means virtually no agreement or reliability. Once again. 
«ebb (q.v.) reports a study in which 104 students were rated for 
intelligence by teachers and fellow students. Most of the ratings 
a low relationship to results on an intelligence test, and the 
et relationship, which was virtually zero, Was found for the 
ratings made by men on women. This finding is quite amusing, but 
i is also Significant. The difficulty in undirected attempts to rate 
Pa igence is the absence of criteria, and for this reason any 
relevant influence may have a completely disturbing effect. 
a uch (r920) and also Varner (1922, 1923) have experimentally 
bined the causes of this difficulty, and have explored the possi- 
be Ity of overcoming them. They find first that a clear wor 
b “ception of intelligence must be set up, so that other traits wi 
et 'Stegarded as far as possible, such a clear conception usually 
al ng absent, Second, it is necessary, when dealing Le veneer 
Ones to consider the age of the person or persons being rated. 
herwise there is no way of forming an idea of what may prop 
he n< eXpected, and size and grade placement may gets er 
ola Stimate and very often do. Third, it has been 7 ie F ri 
chila, Children are easier to rate than younger ones, an i AN ou 
far Ten are easier to rate than bright ones. Dullness eget 
More distinctively and unavoidably than aig whic 
oft easily be obscured by shyness. And also the unusual c aoe 
taken d's reactions may not be cad at pi within 
into consideration. Fourth, it is fo a 
r TE ed ke more accurate than are attempts to rale 
Stit Ten in general. This, of course, is because the grade Sees n 
argnt®s a frame of reference on the general principle of a s a J 
if a vation group. Fifth, it tends to improve ratings and baa 
asonable distribution of ratings is imposed on those who do 


er] 


~ 
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the work. Those who make such estimates usually tend to rate 
very high. Two groups of teachers were instructed to use the 
normal probability curve as a guide in making their estimates, 
each group using a five point scale. One group reported 63 children 
as A, 93 as B, 188 as C, 61 as D, and 6 as E in intelligence. The 
other group reported 28 as A, 54 as B, 36 as C, 12 as D, and x as 
E. So even when advice to use the normal distribution is given, 
there is a strong impulse to disregard it. And when no such ad- 
vice is offered, the extreme skewness of the results invalidates 
them. 

The significant outcome of these investigations is not merely 
that it is possible, even without the use of tests, to make a fairly 
accurate and valid estimate of a person’s general intelligence. The 
more important consideration is how such an estimate must be 
made. It must be a properly oriented survey that confines itsel , 
as far as may be, to those aspects of a person’s behavior WPIC 
reveal intellectual capacity and which involve intellectual taS a 
Conclusions regarding a person’s intelligence must be draw? Í 
the difference in this respect between him and others, with sue 
disturbing factors as his age or the personal impression he makes 
properly discounted. And he must be rated with reference to some 
feasible and reasonable notion of the distribution of general intelli- 
gence, this usually involving the assumption that such a distrib” 
tion will be approximately normal. 


CONCLUSION 


_ Taking into consideration all the varied lines of thought and 
Investigation which have been summarized in this chapter) Í 5 
clear enough what the concept of general intelligence as it ha 
arisen in modern psychology and employed in modern testing 
means. It does not stand for some monolithic unitary, sharp 3 
defined mental entity. It cannot be equated to “altitude” of inte 3 
lect alone, although the ability to succeed with increasingly dif 
cult tasks is one of its best indications, Nor is it by any mea 
equivalent to learning capacity, at least in any very precise 2 
specific sense. When we speak of a person’s general intellige?” 
we mean a congeries of mental functions, These it is possible 
survey and appraise, and the result can be expressed in an over 
or global score or rating. But such a score, though meaningful ae 
useful, can never tell the whole story, because there will be 4U4 


on 
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tative differences that it will not reveal. That is, two persons with 

€ same over-all rating may and probably will show different pat- 
terns of Strength and weakness in connection with tasks of differ- 
ent kinds, These differences are in part no doubt due to training, 
aectience, and opportunity, but by no means wholly so. And they 
Pk ust be considered true differences in intelligence. General 
K elligence, moreover, has many aspects or components which 
annot be tested, not because they are not important and also not 
Ecause they are intrinsically inaccessible to any kind of measure- 

ent, but simply because the technical means do not exist. Yet 
in lmportant aspects of general intelligence can be i ya 
asp oably well, and presumably as tests improve more and more 
t pects of it will become accessible. Thus the latest tests, suri a 
Da, Wechsler-Bellevue scale, or the Kuhlmann Tests of pc 
evol, pment, Seem to uncover more than the earlier = cs 
So ‘ved instruments, such as the direct revisions of the ue scale. 
Dio Oes not seem at all unreasonable to hope for sti Pat 
i, Sress, the general direction of which seems fairly manifest— 


tati towards tests which will differentiate better and reveal quali- 
Ve d; 


t “Inta: erences more clearly and certainly. 


Sing] clligence,” as Sherman puts it (p. 8), “obviously i ni 
ofe © mental Process, but a practical concept connoting a gr p 
omplex mental processes.” This is how it has been conceive 

in, time of Binet to the present day. Its significance in teory 
ang © the reaction it involves away from a faculty ge? ogy 
Cane wards a holistic or configurationalist psychology. z op 
Com M Practice lies in the emphasis it implies upon 4 ae 
Prehensive survey-like ratings of mentality, oa sn 
ith es Very narrow specific functions. In spite of 1ts vag 
‘oved exceedingly fruitful. Le i 
eng oever, in the light of this account, the dienn a 
Orga y Movement towards a more analytic e n A 
to gep ation Should be very evident. The attempt 2 oe see 
inte] i way from the admitted vagueness of the ng on o aes a 
ing, Bence, to define and isolate definite mental factors, a ba 
ing tive thinking grasp of spatial relationships, meo ain a 
The and the like, and to build tests which will es m, 
en 'S a trend towards tests which will yield profile on a 
Ren, tal factors in place of the “global” score on seen in 3 i- 
ite CS The reasons for this development are abundantly clear, but 


ity “© 
o . 
“tcomes are not yet fully established. 
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QUESTIONS FOR Discussion 


1. Does the fact that general intelligence is loosely defined invali 
date it as a concept? If so, consider some other psychological conceP 
that would be invalidated. aö 

2. Does the fact that the concept of general intelligence arose oe 
practical setting and has been put successfully to practical uses Se 
to validate it theoretically? srs 

3- To what extent does it seem possible to reconcile the definit ay 
of general intelligence that have been quoted? To what extent do t® 
conflict? i 

4. What inferences in regard to general intelligence might be mn 
from the varying treatments indicated in this chapter, and also t D” 
appearing in “Intelligence and its measurement: a symposil 
(q.v.)? , f of 2 

5. Consider some of the practical meanings for the guidance 
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human being of Thorndike’s claim that range correlates perfectly 
with altitude. , 
6. Which of the aspects of general intelligence mentioned by Stod- 

ard seem to lend themselves peculiarly to measurement? Which do 
Not? In Considering this question, refer to the discussion in Chapter r 
of this book on the limitations of tests. 3 , 

. 7. Boynton and Stoddard both insist on the importance of socially 
‘nificant tasks in revealing intelligence. Could this emphasis be 
translated into instruments of measurement? at ; 
b Tn what respects is the description of general intelligence offered 
y Thorn a istic”? ; 

9. Do aye ad on the relationship between intelligence 
and learning invalidate any of the definitions cited in this chapter or 
found elsewhere? ; f 
. 10. Cite and discuss instances from your own experience of qualita- 
tive differences, i.e., differences in the kind of tasks that can be done, 

tween persons more or less on the same level of re gone ; 

‘1. Could artistic creation be considered as a type of intelligen 
Activity > eae 
Ban Applying so far as you can the safeguards soe i 7 a 
tioneq in the chapter, make estimates of the intelligence en 
Persons known to you, and if possible compare those ratings v 


eir tested intelligence, 


CHAPTER Iv 


SCALES FOR THE MEASUREMENT OF 
INTELLIGENCE 

Tue MEANING AND ImporTaNcE or INTELLIGENCE SCALES 
This chapter deals with the most im 
intelligence. The word “scale,” 
measurement, is not at all preci 
can be applied to a wide range 


years. But there are exceptions, as for instance the California 
First Year Scale, intended 


portant scales for measuring 


trary. 


e to be considered are very important 
for two reasons. First, they have great practical value, and are 
d psychometric instruments. Second, 4 


rom the original work of Binet, through 
nsion, its extensive modification, and iy 
; minates in the appe. f definite 

new orientations. PEO 


Tue Work or BINET 


: As was pointed out in the previous chapter, the work of Binet 
in the field of psychometrics stemmed in general from a lifelong 
interest in the problem and specifically from the assignment to him 
of a major practical task. His first attempt to translate into 4” 
instrument of measurement th 


e idea of general intelligence tha 
96 
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h ørowing in his mind for at least ten years is shown in 
Aera Och oline his earliest syllabus of mental tests. This 
appeared in 1905, but it was not until three years later that all the 
major characteristics of his method appeared. These were em- 
bodied in the 1908 scale, partly summarized in Figure 4: Simply 
put, it consists of a conglomerate of tests classified into age 
levels. Such was the first systematic translation of the conception 
of general intelligence into the items and layout of a comprehen- 
sive instrument of measurement. 

Before his death, Binet made one further revision of his tests, 
which appeared in rorz. Various subtests were added. Some were 
eliminated. Others were shifted to new age classifications. Al- 
though Binet died with his work incomplete, the ideas and meth- 
ods he elaborated and the resulting scales were adopted and re- 
vised in many lands. The account that follows will be confined to 
the American work, which has been very fruitful and thorough. 
To repeat, it forms a coherent story. The expansion, revision, and 
partial supersession of the ideas and practices originating with 
Binet do not follow a strict chronological order, to be sure. Some 
very far-reaching departures were at least suggested and to some 
extent put into effect very soon. But the whole great body of work, 
extending over more than thirty years and enlisting the efforts of 
some of the ablest men in the field, is a logical development which 
constitutes, for better or worse, the very core of modern psycho- 


metrics. 
EXTENSION or Binet’s Work: THE STANFORD REVISIONS 


Stanford Revision of the Binet Scale for the Measurement 


of Intelligence * 
2. Revised Stanford-Binet Tests of Intelligence + 


1, 


These two revisions, published in 1916 and 1937, were not the 
first to appear in this country, but they are major landmarks in 
mental testing. The work had ample financial backing and is an 
outstanding model of test construction. It is definitely an exten- 
sion of Binet’s ideas, rather than a departure from them. Our 
Concern will be with the second revision, and reference to the first 


will be chiefly for the sake of background. 


* References: Terman, 1916; Terman et al., 1917. 
t References: Terman and Merrill, 1937 a, 1937 b; McNemar, 1942. 
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AGE III 
x. Pointing to nose, eyes, and mouth, 
2. Repetition of short sentences. 
3. Repetition of two digits. 
4. Enumeration of objects in pictures, 
5. Knows last name. 


1. Compares two boxes of different weights, 

2. Copies square. 

3. Rectangular card cut diagonally to be reconstructed accord- 
ing to a similar uncut card. 

4. Counts four coins. 

5. Repeats ten-syllable sentence, 


AGE VII 
1. Tells what is missing from unfinished pictures. 


Knows number of fingers on one and both hands without 
counting. 


3. Copy of written model. 
4. Copies diamond, 

5. Repetition of five digits, 
6. Description of pictures. 
7 
8 


p 


. Counts thirteen coins. 
. Knows names of four common coins, 


1, Repeats months of the year. 
2. Knows names of nine pieces of money, 
3. Uses three given words in one sentence, 
j4 Comprehension of easy questions. 
J5- Comprehension of difficult questions, 


AGE XIII 
1. Paper folding and cutting (as in 1905 tests), 


2. Rearranges two triangles in imagination and draws result. 
3. Differences between pairs of abstract terms. 


Fic. 4. Excerpts rrom First BINET SCALE (1908) 
(After Pintner, 1931, pp. 137-40) 
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I. Characteristics ; 
A. General characteristics. Both the Stanford Revisions are in- 


struments for individual administration, not group scales or tests. 
The first revision contained 90 subtests organized into a single 
form. The second revision contains 129 subtests in each of its two 
forms equivalent in difficulty, designated as Form L and Form M. 
Most of the subtests from the first revision are retained in the 
second, although there have been some changes and eliminations. 
For example, there were items in the old Absurdities Test, located 
at age 10, which were found to be emotionally disturbing (e.g., 
Yesterday the police found the body of a girl cut into eighteen 
pieces. It is believed she killed herselj.). And these have been 
changed. The new scale covers a wider range of ages than the 


earlier revision. The old scale ran from age 3 to 14, and then 


through “Average Adult” and “Superior Adult.” The new scale 
runs from age 2 through 14, and then through four adult levels; 
namely, Average Adult, Superior Adult I, Superior Adult II, 
Superior Adult III. The old scale grouped its subtests in I-year 
steps from 3 through ro, then presented groupings for ages r2 and 
14, and then the two adult classifications. The new scale has group- 
ings at 6-month intervals from ages 2 through 5, groupings at 
I-year intervals from ages 6 through 14, and then the four adult 
levels. Thus it offers an important extension both of content and 
age application. 

B. Jtems and subtests. For both the earlier and later revisions 
very thorough canvasses for good test items were made. All items 
already in use in intelligence tests were collated and studied and 
additions were suggested. The items were critically scrutinized, 
and those that passed the primary selection were tried out on a 


standardization group of 3184 persons. In this work tests were 
given in r7 communities located in 11 states, representing the East, 
the South, the Midwest, and the West. Rural, urban, and occupa- 
tional groups of wide variety were represented in the standardiza- 
tion. Urban representation was somewhat disproportionately high, 
and compensation for this was made at a later stage in the process 
by setting the median intelligence quotients for the age groups at 
Slightly over roo. Occupational representation was held propor- 
tional to the distribution of occupational populations in the country 
as a whole as shown in the United States census. Only American- 
born white children were included in the standardization. This was 
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Dunan 


YEAR II 


(6 tests count x month each or 4 tests count 1% months each) 


. Place three small blocks into similar holes in a board.* 

. Point to toys when their names are given. 

. Point to parts of a large paper doll when parts are named.* 
. Build a four-cube tower after demonstration. 

- Name common objects shown in separate pictures.* 

. Use a two-word sentence spontaneously (e.g., See kitty) * 


(Alternate: Obey simple commands to manipulate small toys.) 


YEAR III 


(6 tests count 1 month each or 4 tests count 114 months each) 


. Obey simple commands to manipulate small toys.* 

. Name common objects from separate pictures.* 

. Point to the longer of two sticks, 

. Name at least three objects shown in one Picture, 

. Point to objects to indicate use (e.g., Show me which one we drink 


out of).* 


. Tell what to do in common situations.* 


(Alternate: Draw a cross with a pencil after demonstration.) 


YEAR VI . 


(6 tests count 2 months each or 4 tests count 3 months each) 


. Define five words orally by description, use, or classification.* 


Make a simple bead-chain pattern from memory after demon- 
stration.* 


Tell what part is missing from pictured objects. 
Select certain numbers of blocks from a pile.* 


Point to one of five pictured objects which is different from the 
rest.* 


Draw a pencil line through a simple maze to make the shortest 
path, 
YEAR X 
(6 tests count 2 months each or 4 tests count 3 months each) 


Define eleven words orally.* 
Explain why the pictured actions of a person are foolish. 
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~3. Read a passage of 48 words and recall from memory a consider- 


able part of it. 
4. Give two reasons in support of an oral statement.* 


5. Name as many disconnected words as possible in a minute.> 
-6. Repeat six digits after one oral presentation.* 


AVERAGE ADULT 
(8 tests count 2 months each or 4 tests count 4 months each) 
x. Define twenty words orally.* 
2. Transcribe a short passage in a code which is exposed.* 
3. Give differences between two abstract words.* 
4. Read short arithmetic problems and answer without using paper 


and pencil. 
5. Tell what proverbs mean in own language. 
6. Give oral solution of a practical mechanical problem presented 


orally.* 
7. After one oral presentation repeat 24-syllable sentence without 


error, 
8. Tell in what way pairs of words are alike. 


SUPERIOR ADULT 


(6 tests count 6 months each or 4 tests count 9 months each) 


1. Define 30 words orally.* 
2. Read aloud a problem concerning direction and distance traveled 


and give answers without using paper and pencil. 


3. Give opposites of words.* 
4. Watch examiner fold and cut a piece of paper, then make a pencil 


drawing to show how paper would look unfolded.* 
5. Read silently while examiner reads aloud a simple geometric pro- 
gression problem, then give answer without using paper and 


pencil.* 
6. Repeat 9 digits after one oral presentation. 


* The asterisks indicate tests to be used as abbreviated scale. 
A st ed oe ee, 


Fic. 5, SAMPLE OF THE REVISED STANFORD-BINET Tests oF 
INTELLIGENCE 


(Adapted from Terman and Merrill. 1937 b after E. B. Greene) 


kA 
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procedure of extraordinary thoroughness. To give some idea of 
what was involved, the final selection of items required choice from 
30,000 cards used for tabulating data on testing run in the various 
communities. It is interesting to compare this with the standardi- 
zation of the former revision made in 1916. Here the work was 
done with a group of about 1000 children which included all those 
of suitable ages attending a school in a typical middle-class com- 
munity, the school being the only one in the community and en- 
rolling all the children. This earlier standardization has been 
criticized as somewhat too high, because the group was drawn 
from a community in California, a state in which mean regional 
intelligence is above that of the nation as a whole. Such a choice 
would have the effect of making the norms of the Stanford Revi- 
sion somewhat unduly exacting, and of making obtained mental 
ages and intelligence quotients unduly low. 

The probable validity of theitems finally selected for the meas- 
urement of intelligence was determined by the use of a variety of 
criteria (Terman and Merrill, 1937 b; McNemar, 1942). (a) 
First there was the general opinion as to their worth formed by 
the corps of expert workers who constructed the scale. An im- 
mense variety of items were discussed and analyzed, and only the 
best survived this preliminary selection. (b) A second criterion 
was the increase in the percentages of children who succeeded 
with each item at increasing ages. Only those were retained which 
showed a rising gradient of successes at older ages. (c) A specially 
devised discrimination quotient was worked out and applied, which 
was based on the differences of the ages of those who passed and 
those who failed each subtest. (d) The correlations of each subtest 
with the composite total scores on the two forms L and M was 
used as a selective criterion. Subtests that did not show a satis- 
factorily high correlation were rejected. This, clearly, would tend 
to make the scale a homogeneous instrument. : 

As both Terman and McNemar point out, this account of the 
processes of item selection is an effective answer to critics of the 
scale who claim that the only criterion used was the increasing Pe!” 
centages passing on each item at successive advancing ages. It is 
hard to see how anything better could well be done to secure a 
preliminary validation of the items to go into any test. The crucia 
point, of course, is the judgment and choice of the workers who 
made the test. They were governed by a certain working concep” 
tion of general intelligence and its manifestations; and if this 


<o 


a 


\ 
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were at fault, everything would fail. For the other criteria are 
internal and turn on the logic of the instrument itself, always 
depending on the basic assumption that it really is a valid means 
of revealing intelligence. 

There were in addition certain secondary considerations which 
were kept in mind in the selection of items and subtests. The work- 
ers desired to construct a scale that would be as easy as possible 
to score. Other things being equal, they chose items which were 
interesting to children and which were of varied type. Economy of 
time, too, was a point considered. It was desired to keep the time 
limit for testing within 75 minutes for the older subjects and 
within 5o minutes for the younger ones. " 

As to the type of items which emerged finally from this extended 
selective process, some idea may be formed from the partial synop- 
tic outline in Figure 5, but if possible the reader should examine 
the scale itself. Many of the subtests involve the use of pictures, 
objects, the ability to indicate parts of the body, and so forth. 

ese nonverbal and performance subtests occur with particular 
frequency at the earlier ages. At the higher levels there is a larger 
Proportion involving abstract verbal and numerical processes, 
Immediate memory for words is a type of subtest which appears 
at many places in the scale with varying degrees of difficulty, of 
Course, The use of performance items meets to some extent the 
criticism which was made regarding the earlier revision, to the 
effect that it was unduly verbalistic. Indeed some critics seem to 
cel that the present scale has gone too far in the opposite direction 


Burt A 
h Sadly and standardization. As has already been pointed 
out, the Revised Stanford-Binet scale is an age scale. That is to 
Say, the tests are grouped in terms of age levels. In this respect 
€rman has continued to follow the pattern set by Binet as Jong 
ago as 1908. There has been considerable criticism of this decision 
On the ground that improved modes of test construction which 
ee developed in recent years were ignored. We shall take up 
ese i ions later on. 
a ae in assigning the subtests to the proper age 


h ete 
Classifications was as follows. The standardization group already 


described in general yielded about 100 subjects at each half-year 
interval from, ages II through V, and about 200 subtests at each 

€ar interval at ages VI through XIV. It has already been pointed 
Out that the term mental age may have two different meanings, 


v 
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It may mean the chronological age at which a given score is the 
mean score, or it may mean the average chronological age of all 
those making a given score. Putting it otherwise, one may take 
the scores on the test as the base line and refer the chronological 
age values to them, or one may take the chronological ages as the 
base line and refer the scores to them. In statistical terminology 
this again means that mental age may represent the regression of 
age on score, or the regression of score on age (v. Thurstone, 
1921; Otis, 1916). The choice of the workers in this case was to 
base mental age on the chronological age of those achieving a 
given test performance.* That is to say, they fitted test perform- 
ances to the chronological ages of those who made them. 

This might have been done simply by assigning each subtest 

_ to the age level at which it was passed by 50% of the age group 
of standardization subjects. For instance, Subtest 3 at Age X re- 
quires the subject to read a passage of 48 words and then to recall 
a considerable amount of it. This might have been standardized 
at the ten-year-old level because approximately half of the 200 
children in the standardization group who were in the ten-year- 
old category were able to do it. The procedure has been used else- 
where in test construction, but it was not the one employed here. 
Actually this one was an elaborate cut-and-try, trial-and-error 
job of manipulation in which subtests were experimentally shifted 
from one age level to another until finally in each case the median 
mental age of each age group would correspond to its median 
chronological age. 

As will be seen from Table 9, the actual percentages of the 
different age groups passing the subtests for each age level in the 
final layout are not constant. They range from 17% of the four 
year olds to 37.4% of superior adults. The workers who con- 
structed the scale had a number of reasons for arranging the sub- 
tests in what might seem a complicated and indirect manner. 
Decidedly their most important reason, however, and the one of 
interest here, was their belief that the real value of increases in 
mental age diminishes as age advances. That is, they thought of 
the process of mental growth as going very rapidly early in life 
and then slowing up, so that the true difference between a ment@ 
age of 13 and one of 14 would be less than that between 4 and 5: 

*The two methods of determining a mental age will give different values 


except under one particular condition, i.e., when there is a perfect correlatio® 
between test scores and chronological age. 


Stant, This criticism has been made from time 
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TABLE 9 


PERCENTS PASSING For VARIOUS AGE Groups, STANFORD REVISION op 
THE BINET SCALE 


(Quoted from Terman, et al., 1917, Table 43, p. 158) 


Average Percent 
Fear Group Passing 


7 ees 
LY tit eaten ice n, 77.0 

71.3 

70.8 

68.0 

63.2 

62.3 

64.5 

62.4 


BUM E Ashish ns Ra 55.6 


Average Adult ... | 59.8 
SUDEHOE AGU A tate scamouureen | 37-4 


old and 13 years old, a smaller percentage of the 13 year olds 
Must pass them than of the 4 year olds, because the ceiling,” or 
the next classification, is much closer (again v. McNemar, 1942), 

n actual practice, the tests were manipulated and fitted together 
into the age scale so that the median intelligence quotient for each 
age 8toup would come out at just over 100, this being done to 
Compensate for the inadequate sampling of rural children in the 
Standardization group. It should be noted here that this diq not 
Mean a manipulation of the subtests in such a way that the ob- 
tained intelligence quotient of any child would necessarily be con- 
Ti to time, but it is 


based on a misunderstanding. The purpose of the procedure was 
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to insure that the median intelligence quotient of approximately 
too should always have the same meaning at all age levels in 
terms of test performance. This is absolutely necessary if the 
intelligent quotient is to be used at all, for it cannot stand for 
one set of relationships and performances at one level and some- 
thing different elsewhere. But whether the actual intelligence quo- 
tient of any child will vary or remain constant remains an open 
question. When a foot rule, which is a physical scale, is set up so 
that every inch is equal to every other inch, this does not mean 
that every object measured will always keep the same length under 
all circumstances, but merely that the measurements themselves 
will always have a constant meaning. 

D, Administration and scoring.* The indicated way of using the 
scale is to begin well below the probable mental level of the child 
with whom one is dealing. Specifically the testing should start low 
enough on the scale so that the subject will achieve 6 consecutive 


successes on the subtests, and it should be continued not to the 


point of the first failure, but until 6 consecutive failures have 
occurred. Thus the child does not earn a given rating by nothing 
but the tests at the given age level but by certain successes at that 
level, plus successes in preceding years, minus certain failures 
above the given age level. A clear grasp of this scoring procedure 
will resolve certain not infrequent misunderstandings in regard 
to mental age, which is often supposed to mean a very clear-cut 
success with all tests on a given age level, with no failures at or 
below it and no successes above. It is the performance within this 
range from the starting point to the highest point achieved that 
provides the profiles that were discussed in the last chapter. 

The scale yields two chief types of scores, (a) First there is the 
mental age, expressed in years and months, and commonly re- 
ported in the symbolism 8-5 (eight years and five months), etc. 
This is an indication of mental maturity. Readiness for first grade 
entry, for example, would be considered to depend on the child’s 


mental age in the first instance, rather than on his intelligences 


quotient, in so far as it is determined by measurable mentality- 
(b) Then there is the intelligence quotient, the basic idea of which 
is that age must be considered if one wishes to form an opinion of 
a child’s brightness. This is a very common-sense notion, as every 
parent who has boasted of the wonders his young hopeful can per 


* See Terman and Merrill, 1 


o errill, 1937 b. Terman and Merrill, 1937 a, is a more ©” 
cise account of administrative 


and scoring practice. 


a 
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form at a very early age should readily understand. All that is 
done by the intelligence quotient is to transform such evaluations 
into numerical scores by dividing the child’s mental age by his 


i chronological age. 


A special problem arises at the upper levels in connection with 


larly beyond a certain age level. Thus there seems to be a point, 
Or age of arrest, beyond which mental age .does not increase al- 
though of course chronological age increases just as before, Since 
a mental age up to 14 means specifically the representative mental 


the intelligence quotients of all persons of 16. and over were cal- 
culated on this figure as the denominator, This practice was ad- 
mittedly somewhat arbitrary and open to.a number of questions, 
i d scale it has been considerably modified, In com- 
Þuting ages for the sake of working out intelligence quotients, all , 

chronological ages from 13 to 16 are scaled down by Ys of the i 
excess over 13. Thus a chronological age of 14 years is figured 
at 164 months instead of its normal full value. This is arrived at 
by taking 156 months, which is the full value for 13 years, adding 
year, and subtracting Ys of 12 ac- 
Cording to formula, which is 4 months, giving a total of 164 
al age of 15 years is figured at 172 
intelligence quotients, i.e., those 
f 16, the base line or denominator 


promise solution, and in a 


Measure an arbitrary one, admittedly so, although there are argu- 


ments in favor of it: What the reader should understand is that 
ilure of test performance to 


with age beyond a certain leve] Probably are not 
basic Fee of mental pone but seem due to the par- 
ticular Selection of subtests and items in the Stanford-Binet scale, 
Which © not permit regular advance beyond a certain point. Other 
tests, €.g., the Wechsler-Bellevue, have shown a much longer ad- 
Vance, This does 1.0t invalidate the Stanford-Binet, but limits its 


Usefulness and applicabilitv. 
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2. Special Problems: Criticisms: Appraisals 


The Stanford-Binet scale has such a focal position that a just 
appraisal of it is of great importance. It has constantly been used 
as a reference point in the construction and validation of other 
psychometric instruments. The methods used in its construction 
have been widely copied. And there is hardly a question connected 
with mental testing and its outcome upon which the results and 
principles of the Binet scale do not have some bearing. 

A. Problems connected with scoring. (a) Both the M.A. and 
the I.Q. have double meanings. We have seen that mental age is 
conceived as the representative test performance of a given stand- 
ardization group. This interpretation is consistently followed up 
to C.A. 13, and almost so up to C.A. 16. On the plan described 
above for upper age levels, C.A. 16 is figured as 15, i.e., 13 years 
plus 34 (16-13), or 15 years. Up to this point, then, there is a direct 
coincidence between test performance and chronological age 
grouping, but not beyond. So an assumption was needed. This 
was that the distribution of adult I.Q.’s would be the same as those 
from ages 5 to 10, and the tests for these upper levels were assigned 
and scaled to bring this about. Clearly then all mental ages for 
persons above this dividing line, and consequently all intelligence 
quotients derived from them, are hypothetical, inferential, and 
consequently open to doubt. A great many workers in applied psy- 
chology have recommended that the intelligence quotient as re- 
vealed by this particular scale and its techniques should not be 
used at all for persons above the age of 16. Another way of look- 
ing at the same difficulty is to say that the age of arrest fixed at 
16 is extremely hypothetical and dubious, and probably a function 
of the scale rather than a phenomenon of mental growth. When 
the intelligence quotient of a person 20 or 30 years old is calculated 
as if he were 16 years old, something quite obviously questionable 
is involved. Symonds (1927), for instance, recommends that in 
dealing with individuals in senior high school and beyond, intelli- 
gence quotients should not be figured. Instead, the test perform- 
ance of any such person should be compared with that of the 
group in terms of which he is to be rated by means of percentile 
rankings, or standard scores, or some similar statistical device: 
And the suggestion has been frequently repeated and put into 
effect. 

(b) A grave question has been raised as to the statistical st4 
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bility of the intelligence quotient. We have seen that the scale is 
set up and the subtests placed in such a way that an IQ. of roo 
always has the same meaning at all ages. What it always indicates 
is the median test performance of the age group concerned. This à 
being given the mental age value of the median chronological age 
of the group, it always yields an intelligence quotient of roo* 
But this does not in any way prove that I.Q.’s other than roo— 
1.Q.’s of 120 or 8o, let us say—will also have the same meaning at 
all age levels, What if an 1.Q. of 120 is 1 standard deviation above 
the median at the age of 6, and 2 standard deviations above it at 
the age of 12? Far more people would make this I.Q. score at 6 
than at 12, Or Putting the case in other words, an IQ. of 120 
would be easier to get and would mean less actual intelligence than 
what would seem to be the identical I.Q. at 12. To be sure, such 
extreme fluctuations as that in our hypothetical illustration do 
not occur, But the intelligence quotient is by no means entirely 
Stable at different ages. This is shown from the data in Table ro. 
The standard deviation of the I.Q.’s of the age groups of the stand- 
ardization group on Form L of the scale is 20 at the age of r2, and 
72.5 at the age of 6. So an LQ. 2 standard deviations below the 
median at 6 would work out at 100 — 2 X 12.5 = 75, whereas at 12 
it would work out at 100 — 2 X 20 = 60. The same relative per- 
formance, that is, would receive very different ratings. A rough 
8taphic representation of what this analysis means is presented 
in Figure 6. It shows what happens to three intelligence quotients 
Which ‘are respectively 1, 2, and 4 standard deviations above the 
Mean at the age of 2. Reading from Table 10, we m n the 
Standard deviation for age 2 is 16.7 on Form L. So the ree ae 
Would work out at 117, 133, and 167 (approximately). Their 
fluctuations are shown by the three curves drawn through 
Points plotted 1, 2, and 4 standard deviations pp og one 
Successive y year age levels, according to the data = orm L in 
able ro, Tt is quite clear that there is a considerable uctuation, 
2nd that it is more and more marked the more extreme the I.Q. 
Further consideration of this highly important topic must be post- 
Poned until scores alternative to the intelligence quotient have 
een discussed, as it involves a great deal more than this imme- 


diate į 
Ssue. ee ae 
(c) A large accumulation of reliability data indicates that re- 
d 


* With i d deliberately introduced qualification already noted, for 
the Mean cg ae pe out slightly over too for an excellent reason. 
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as a reference point in the construction and validation of other 
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have been widely copied. And there is hardly a question connected 
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Principles of the Binet scale do not have some bearing. 

A. Problems connected with Scoring. (a) Both the 
the I.Q. have double meanings. We have seen that me 
conceived as the representative test performance of a g 
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plus 34 (16-13), or 15 years. Up to this point, then, there 
coincidence between test performance and chronological age 

yond. So an assumption was needed. This 


M.A. and 


vealed by this particular scale 4 
used at all for persons above the 


omenon of mental growth. When 


Person 20 or 30 years old is calculated 
as if he were 16 years old, something quite obviously questionable 


is involved. Symonds (1927), for instance, recommends that in 
dealing with individuals in senior high school and beyond, intelli- 
gence quotients should not be figured. Instead, the test perform- 
ance of any such person should be compared with that of the 
group in terms of which he is to be rated by means of percentile 
rankings, or standard scores, or some similar Statistical device. 
And the suggestion has been frequently repeated and put into 
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(b) A grave question has been raised as to the statistical sta 
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bility of the intelligence quotient. We have seen that the scale is 
set up and the subtests placed in such a way that an I.Q. of roo 
always has the same meaning at all ages. What it always indicates 
is the median test performance of the age group concerned. This 
being given the mental age value of the median chronological age 
of the group, it always yields an intelligence quotient of 100.* 
But this does not in any way prove that I.Q.’s other than roo— 
1.Q.’s of 120 or 80, let us say—will also have the same meaning at 
all age levels. What if an I.Q. of 120 is 1 standard deviation above 
the median at the age of 6, and 2 standard deviations above it at 
the age of 12? Far more people would make this I.Q. score at 6 
than at r2. Or putting the case in other words, an I.Q. of 120 
would be easier to get and would mean less actual intelligence than 
what would seem to be the identical I.Q. at 12. To be sure, such 
extreme fluctuations as that in our hypothetical illustration do 


Not occur. But the intelligence quotient is by no means entirely 


stable at different ages. This is shown from the data in Table ro. 
The standard deviation of the I.Q.’s of the age groups of the stand- 
ardization group on Form L of the scale is 20 at the age of 12, and 
12.5 at the age of 6. So an I.Q. 2 standard deviations below the 
Median at 6 would work out at 100 — 2 X 12.5 = 75, whereas at 12 
it would work out at 100 — 2 X 20 = 60. The same relative per- 
formance, that is, would receive very different ratings. A rough ~ 
graphic representation of what this analysis means is presented 
in Figure 6. It shows what happens to three intelligence quotients 
which ‘are respectively 1, 2, and 4 standard deviations above the 
mean at the age of 2. Reading from Table ro, we see that the 
standard deviation for age 2 is 16.7 on Form L. So the three I.Q.’s 
would work out at 117, 133, and 167 (approximately). Their 
fluctuations are shown by the three curves drawn through 
points plotted 1, 2, and 4 standard deviations above the means for 
Successive 1 year age levels, according to the data for Form L in 
Table ro. It is quite clear that there is a considerable fluctuation, 
and that it is more and more marked the more extreme the I.Q. 
Further consideration of this highly important topic must be post- 
Poned until scores alternative to the intelligence quotient have 
been discussed, as it involves a great deal more than this imme- 
diate issue, 

(c) A large accumulation of reliability data indicates that re- 


* With the slight and deliberately introduced qualification already noted, for 
© mean I.Q.’s were brought out slightly over too for an excellent reason. 
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TABLE 10 


Sranparp Deviations oF I.Q.’s or AcE Groups on REVISED 
STANFoRD-BINET SCALE 


(Quoted from Terman and Merrill, 1937, Table 7, p. 40) 


‘ Standard Standard 
Chronological Numbers in Deviations of Deviations of 

Ages Age Groups 7.Q.’s on Form L | 1.Q2s on Form M 

a 102 16.7 15.5 

24 102 20.6 20.7 

3 99 19.0 15.7 

3 103 17.3 16.3 

4 105 16.9 15.6 

4h 10r 16.2 15.3 

5 109 14.2 14.1 

sth 110 14.3 14.0 

6 203 12.5 13.2 

7 202 16.2 15.6 

8 203 15.8 15.5 

9 204 16.4 16.7 
10 201 16.5 15.9 
a 204 18.0 17.3 
a 202 20.0 19.5 
13° 204 17.9 17.8 
14 202 19.0 16.7 
15 107 16.5 19.3 
16 102 16.5 17.4 
17 109 14.5 14.3 
18 IOI 17.2 16.6 


liability varies with the size of the I.Q. For chronological ages from 
6 to 13, the error of measurement ranges from 2.8 for low 1.Q.' 
to 5.3 for high I.Q.’s. The corresponding deduced reliability 
efficients are .97 and .go. Thus general reliability coefficients es 
not be worked out for this scale, or for any such scales. i 
reasonable expectation would be as follows. Suppose three pary 
each with a mental age of 100 months, and that the error of m A 
urement at this M.A. level is 4.42. Then if one of them had a C. 


sf j! 
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Fic. 6. FLUCTUATIONS IN THREE INTELLIGENCE QUOTIENTS AT 
VARIOUS AGES 


of 100 months, and thus an I.Q. of roo, this obtained score would 
mean an I.Q. of roo + 4.2. If the second were C.A. 80, this would 
give him an obtained I.Q. of 125, which would indicate a true I.Q. 
Of 125 = 5.2 If the third hada C.A. of 125 months, this would give 
him an obtained I.Q. of 75, and would indicate a true 1.Q. of 


75 + 3.4 (McNemar, 1942). 


B. The vocabulary test. 

This is another major center of controversy. The vocabulary 
test has a very important place in the scale, and appears at many 
age levels. Far-reaching issues are involved in it. The test itself 
Consists of a list of 100 stimulus words selected by arbitrary rule 
from a small standard English dictionary containing in all 18,000 
Words. The purpose of this method of choosing the words was to 
Secure an unbiased sample considered to be of sufficient size and 
at the same time workable in practice. The words are administered 
Orally to the subject, who makes an oral response on which he 
1S rated, 

It has been criticized on many grounds. (a) The selection of 
Words is said to be meager and arbitrary and insufficient to afford 
any true indication of the total vocabulary of the subject (Kent, 
1937). To get an estimate of the person’s total working vocabulary 
1S Of course the whole purpose. Thus if a subject makes 20 correct 
definitions, he is credited with a total vocabulary of 3600 words 

cause the sample of 100 to which he responds is selected from a 
total list of 18,000. So again a ro word correct response would 
Indicate a total vocabulary for the subject of 1800 words. There 
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measure the intelligence of individual persons, like any other. We 
shall return to this consideration in another connection, as it is one 
aspect of a broad issue in test construction. (b) An age scale of 
this type is criticized as very rigid and as wasteful of time and 
material. Administration from the point of 6 consecutive successes 
to the point of 6 consecutive failures wastes a great deal of time on 
the eliciting of responses that have no bearing on the intelligence 
score. Many of the subtests which are assigned to a single level 
lend themselves to graded use over a wide range of mental ages, 
with appropriate changes in the expected scores. If this had been 
done in the construction of the instrument, it would have been 
much more economical of material, and also much more flexible, for 
as it now stands it must be used in its entirety according to the 
instructions, or not at all. The argument here is in favor of a point 

~ scale as contrasted with an age scale, i.e., a scale in which each 

2 subtest contributes to a total numerical score, instead of one where 
certain subtests in a rigid order must be passed in order to estab- 
lish a given mental age. It is interesting to note that in an experi- 
mental situation the Revised Stanford-Binet scale has been trans- 
formed into a point scale, and when it was applied to 44 subjects 
ranging from 8 to 18 years of age, it yielded results much like the 
original instrument, with a saving of about one-third of the time 
(Growdon).. 

To these criticisms of the practical limitations of the instrument 
McNemar (1942) has replied that the “juggling” of items and 
subtests to equate mental and chronological age did indeed take 
place, and that rigidity and wastefulness are indeed involved. But 
he remarks that even the scores obtained on point scales are very 
often if not usually transformed into mental age values, and that 
these are found very intelligible.* It must be remarked that this 


rebuttal is. by no means wholly satisfactory, and that it hardly ` 


dispels the criticisms of an age scale to say that point scores also 

can be read as age norms. The issue of the objection is that point 
scores are more economical to obtain and that a test set up to 
yield them is more flexible. As to Growdon’s reported transforma- 
tion of the Stanford-Binet scale into a point scale, and of the 
advantages that accrued, it must be remarked that practically all 

his claims could be made of the vocabulary test alone, which can 

* For one typical instance of the conversion of point scores into mental agë 


norms see Table 8, which shows the transformation in the case of the Otis Self- 
Administering Test of Mental Ability. 
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yield point values for various ages, gives results correlating closely 
to those of the whole scale, and takes much less time. 


D. Composite character of the scale. 


The composite character of the scale has been assailed (Wells, 
1938; Stoddard, 1943). Often the expressions which the critics 
allow themselves to use in this connection are hardly less than 
abusive. It has been called “a hodgepodge,” a “motley collection 
of tasks,” and so on. And it has been likened to a refurbished 
model T Ford, this being a reference to its composite character 
which Terman took over from Binet and embodied in both his 
revisions. It may very well be that a composite instrument con- 
taining a large variety of types of items and subtests is the logical 
translation into practice of the vague and inclusive yet significant 
F concept of general intelligence. 

» Still the question of what this “composite” actually measures is 
an entirely fair one. McNemar (1942), using the techniques of 
factor analysis, has shown that in spite of the wide variety of sub- 
tests and items, the scale taken as a whole seems to measure in the 
u main one thing; to wit, a “general factor.” But it is legitimate to 
pe quire further as to the nature and meaning of this “one thing.” 
Before turning to this, a word is in order here as to the kind of 
Substitute which some of the critics would presumably recommend. 
What many of them would have in mind is an array of tests deal- 

Ing with much more sharply defined mental processes and func- "i 
tions, such as verbal ability, numerical ability, inductive reason- 
ing, deductive reasoning, and the like. Such clearly defined enti- 
ties are among the outcomes of a certain type of factor analysis. 
McNemar answers very truly that such sharply defined ‘purified” 
tests are likely to be much less valuable clinically than composites 
Which elicit performance in many different situations and in terms 
Of responses to a varied array of items. However, since tests of this 
latter kind have only just begun coming into existence, there is 
* Rot experience enough as yet to determine their clinical possi- 
bilities, particularly when a sufficiently varied battery of them is 
Utilized. And as we have seen, the Stanford-Binet scale has not 
recommended itself very highly for clinical diagnosis and the indi- 
hd cation of differences in kind and quality within the total pattern 
i of intelligence. A further reply to the alternative suggested by 
Some of the critics is that such “purified tests as have been made 
and tried out have not yet proved themselves distinctively superior 


le Ce 
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in practice to the best of our existing instruments, certainly in- 


cluding the Stanford-Binet scale. We shall return later on to this 
point. 


E. Level of obtained 1.0.’s. 


One criticism that has been made of the new Stanford-Binet 
scale is that it yields I.Q.’s that are unduly high. Stoddard (1943) 
lays considerable stress on this point. He quotes Mitchell (q.v.) 
as speaking of the “surprise” felt by clinical workers and psychi- 
atrists in getting I.Q.’s of 130 to 140 when accustomed to com- 
paratively low ones by the use of the first Stanford-Binet scale. 
Mitchell found that when a group of 155 psychopathic patients 
were tested with the first scale their mean I.Q. was 91, and that 
when they were tested with the new scale it shifted to 105. In 
the same way the mean I.Q.’s of 155 senior medical students 
shifted from rro to 131 when the new scale was substituted for 
the old. However, one finds nothing particularly unreasonable in 
such data. The new scale is certainly a better test for the upper age 
levels than the old one, although still far from perfect; and the 
reason for the improvement is of course the introduction of four 
groups of adult tests instead of only two. Moreover, it has been 
recognized that the first Stanford Revision was standardized at too 
high a level, due to the selection of an above-average standardiza- 
tion group. The mere fact that the second scale yields prevailingly 
higher I.Q.’s than the first is certainly nothing against it, and may 
indeed be in its favor. As to the increase in the obtained I.Q.’s of 
a psychopathic group of subjects, it must be remembered that 
Piotrowski showed that the first Stanford Revision consistently 
tended to underestimate the mentality of disturbed persons. 


F. What the scale measures. 


The great question towards which almost all specific criticisms 
ultimately lead is: What does the scale measure? The direct an- 
swer seems to be that it measures whatever the mental factors are 
that make people succeed in school, particularly in the academic 
subjects. This is quite probably the real nature of the genera 
factor discovered by McNemar’s factor analysis. A great deal that 
we know about the scale, and a great deal of what has been sa!“ 
about it above converges in this conclusion. It has marked limi- 
tations as a clinical instrument and as an aid to psychiatric 
diagnosis and prediction. It is capable of an unequivocal dem- 
onstration of mental advance to what has been just about the 


q 
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school-leaving age and not beyond it. Even a cursory scrutiny 
of the items embodied in it shows its marked reliance on 
what has been learned, including arithmetic, language, informa- 
tion, and the like, and on tasks quite similar to those which occur 
constantly in school studies. Even the second revision is a prevail- 
ingly verbal test in its upper levels. It is worked out in terms of 
a set of age norms which have a remarkable analogy to the steps 
of the “educational ladder,” and which probably put a good deal 
More significance on one year of development than would be the 
case if one were thinking simply of mental growth in and of itself, 
for a year is after all a calendar unit and has no intrinsic develop- 
mental significance however important it may be in pedagogical 
Organization. Its validation has been prevailingly, though to be 
Sure not exclusively, on the criterion of success in school. And 
lastly, the vocabulary test would certainly seem to be related 
Prima facie to educational achievement and background. It is 
noteworthy that the defense of the vocabulary test put forward by : 
Terman, McNemar, and others associated with them has been its 
striking agreement with the scale itself. That is to say, it is not 
regarded as bringing in new confirmatory evidence of intelligence 
and ability, but rather defended as agreeing with what the whole 
Scale can show somewhat more accurately. Also it must be re- 
Membered that the original practical problem faced by Binet, of 
Whose work the two Stanford Revisions are the most direct con- 
tinuation, was to ascertain intelligenee A expressing itself in 
School situations and school accomplishment. 

n ie ar not defensible to dismiss it as merely an 
ğ i i ” in critics have done. This is 
academic aptitude test,” as certain Saag tees 
Merely substituting one word ne — L patiga di pa 8 
ing expression in the place of a laudato . 
extremely important and Se heat a 
and the test is intended for young people. 
fats ere ed hee ander Ne dx 
decisively as they do in any other aspect ot lives. 

e broad organie concept of general itligence implies ey 
eal more. This is particularly true W. Sea, ii 
mitted and undeniable limitations of the academic milieu, with 
emphasis nian E 8 routine, ae eae et eer 
Or at least undervaluation of originality, E 
and external standards. But to deny that a test which r one 
aptitude for school success in no way reveals general inte ige c i 
fantastic, even admitting the organic limitations of the school. As 
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to whether alleged better tests—narrowed, purified, centered more 
closely and explicitly on sharper concepts—will reveal mental 
processes better, only time can tell. 


MODIFICATION or BINET’S Practices: Port SCALES 


Even before the first Standard Revision appeared, certain quite 
different developments, based on the work of Binet, were in the 
making. The changes contemplated were the dropping 
wise organization, and the use of the same subtest in many in- 
stances at different age levels with different scoring standards. 
The result would be a point scale rather than an age scale, i.e., a 
scale yielding a point score or scores rather than a mental age, 
although the point score might be transformed into an equivalent 
M.A. The Point Scale for the Measurement of Intelligence by 
Yerkes and Bridges (1916; revised 1923) is often considered the 
pioneer instrument of this type. It seems better, howe 
as an illustration a scale still in current use. 


of the age- 


ver, to select. 


1. Herring Revision of the Binet-Simon Tests * 


This, like some other point scales, is essentially a modific 


ation 
of the ideas and practices of Binet, rather than a radical dep 


arture 


=M a 
GROUP A 


1. Tell me what you see in this picture, (4 pictures presented in 


series) 
2. In the first row of numbers tell me what two numbers should 
come next — — (here and here). Go ahead. (8 such rows 


for number completion) 
3. Read this to yourself. Then begin at the beginning and tell me 
everything you have read. (a Passage with 13 “ 
4. Iam going to say some numbers, When I 
numbers backwards. (digit group; 
numbers) 


GROUP B 
5. Showing knees, fingers, ear, foot, 
6. Repetition of 6- and 7-syllable sentences, 
7. Pointing out the larger figure in each of 3 pairs of figures. 


* References: Herring, 1922, 1923. 


memories” 
am through, say the 
s ranging from 2 to 9 


h: 


i 
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Aesthetic discrimination, 4 pairs of faces, 
Naming black, gray, white. 
Giving the solution to 6 problem situations. 


- Reproduction of thought. 17 “memories,” 
- Definition of 7 abstract words. 
I3. 


Reproduction of thought. 12 “memories,” very difficult reading, 


GROUP C 


I4. 
i 
46. 
I7. 
18, 
I9. 
20, 


2I. 
22. 


Give solutions to 5 problem situations. 

Detect absurdities in 8 statements. 

Building 4 sentences of 3 words each. 

Giving rhyme for 4 words. 

Picking out similarities in 6 groups of 3 things each. 

Interpretation of 5 proverbs. 

Reproduction of thought. 13 “memories,” rather difficult 
reading. 

Read 3 scrambled sentences. 

Solve 3 arithmetical problems, 


GROUP D 


23. 
24. 
25. 
26. 
27. 
28, 
29. 
30. 


Repeat 4 sentences of 10 to 13 syllables, 
Directions test. 

Directions test. 

Similarities between 4 groups of 3 things each. 
Generalize from 4 separate but related statements, 
Comprehension of 2 verse passages. 

Sentence completion. 

Problem reading and solving. 


GROUP E 


31. 
32. 
33. 
34. 
35. 
36. 
37. 
38. 


Name gs familiar objects. 
Comparison of forms. 

Perform 3 commands. 

Diagram problem solving. , , : 

Forward repetition of digits, 2-10 in various series, 
Repetition of 3 sentences of 19 to 24 syllables, 

Detection of proportional relationships from material read. 


Code writing. 


ee 


Fic. 7. HERRING REVISION OF THE BINET-Stmon SCALE. 


Synoptic OUTLINE 
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from them. The synoptic outline is shown in Figure 7. A compari- 
son with the Stanford Revisions will be found instructive, for it 
embodies many of the changes which are being recommended at 
the present day, though in this particular early instance they 
were handled somewhat clumsily. 

One notices a considerable similarity in the material, for many 
of the subtests are derived or revised from the original Binet items. 
But there is a fundamental change in their arrangement and treat- 
ment. As will be seen, they are subdivided into five groups. When 
the test is given to a person, the groups are treated as cumulative, 
i.e., each one consists of the preceding plus the new material. At 
the end of each group there are instructions to omit various tests 
in the new material not yet taken up if it is evident that the child 
will pass or fail with it. This shortens the total time needed for 
testing, but that time is still very long, and the instrument is 
cumbersome. The material, moreover, is highly verbal, this corre- 
sponding to Herring’s conception of intelligence as the power of 
abstract thinking. It is not very well standardized. The instrument 
is mentioned here as an instance of a new direction of test con- 
struction that was to bear fruit first in the modification of Binet’s 
practices by Kuhlmann and then in their radical revision by 
Wechsler. 

Before passing on, however, it is desirable to set forth the ad- 
vantages claimed for point scales in general in contrast to age 
scales. These are presented in a series of ten contrasting state- 
ments in Figure 8, which is derived from the work of Yerkes and 
Foster. The case on behalf of the point scale and against the age 
scale is no doubt overstated, and some of the claims made for the 
former cannot be substantiated by reference to existing point 
scales. But certain advantages do seem clear. A point scale, if 
properly constructed, can be flexible in the sense that it need not 
be used in its entirety, since certain subtests can be picked out at 
the discretion of the examiner and still yield significant scores in 
which the meaning is clear from the standardization. The use of 
the same subtests at different ages economizes material and time. 
It lends itself very well to statistical analysis. And there is much 
force in the claim that point scale subtests can be selected with 
reference to a given psychological function rather than in terms © 
the relationship of success to age. Many clinicians and applie 

sychologists have expressed regret that Terman did not convert 
the Stanford-Binet scale into a point scale in the second revision. 
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ba 


Age-Scale Characteristics 


. Tests organized by years or other 


age-units. 
Tests and items selected by rela- 
tionship of success to age. 
Varied, unrelated ungraded tests 
in a composite. 


Point-Scale Characteristics 


Single homogeneous graded scale. 


Tests selected in terms of the 
function to be measured. 

Each test so graded as to be 
available over a wide range of 
ages. 


4. Internally standardized and in- Standardized against external 
flexible. criteria and flexible. 

5. All-or-none ratings of subjects  More-or-less ratings of subject’s 
responses, responses. 

6. Qualitative. Quantitative. 


x 


Measurements not fully amen- 
able to statistical treatment. 


8. Tests weighted equally. 
9. Implicit assumption that of new 


IO, 


appearing or emerging func- 
tions. 

Measurements for different ages 
relatively incommensurable. 


Measurements wholly amenable 
to statistical treatment. 
Tests weighted unequally. 


Implicit assumption that of con- 
tinuously developing functions. 


Measurements for different ages 
comparableand commensurable. 


aa 


Pic. 8. CONTRASTING CHARACTERISTICS OF AcE SCALES AND POINT SCALES 
(Adapted from Yerkes and Foster q.v.) 


P 7 
Moprrications oF BINET’S Practices: THE Work 
or KuHLMANN 


Kuhlmann’s name is associated with the best work in elaborating 
and modifying the practices and ideas of Binet without going to 
€ length of radical revision. 


1. KuhImann-Binet Scale * a i 

Before- coming to his chief present-day a rig 
Must be made of his two important revisions of the ne bei 
Published in rgr2 and 1922. The rọr2 revision adhered larly 


* Reference: Kuhlmann, 1922. 
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closely to the original; but in the 1922 revision, which was the 
result of seven years’ work, considerable and important departures 
occurred, among them the elimination of 19 of the original sub- 
tests, an increase in the total number of subtests to 129 with 8 in 
each age group above 2 years, extension downward to the age of 
3 months, and credit for speed as well as accuracy. The subtests 
for very early ages included carrying an object to the mouth, and 
binocular coordination determined by fixation on a moving object 
(3 months) ; opposing the thumb in grasping, and reaching for 
objects seen (6 months) ; initiation of speech sounds, such as 
mama, baba, dada (1 year) ; imitation of simple movements made 
by the experimenter (2 years). Arthur (1939) in an investigation 
using 200 subjects found that the Kuhlmann-Binet agreed more 
closely with Stanford-Binet results than did a Stanford-Binet 
retest. This instrument is mentioned because it has great practical 
and theoretical importance as a landmark in psychometric prac- 
tice. But in the next of his contributions to be discussed, Kuhl- 
mann, though using many of the ideas embodied in it, definitely 
carried revision a step further. 


2. Tests of Mental Development (Kuhlmann) + 


The Tests of Mental Development embodies many ideas and 
practices that are of great interest. A synopsis of representative 
selections appears in Figure 9. Reference to it will help to make 
the description and analysis easier to follow. 


1. Characteristics 


A. General characteristics. Kuhlmann’s Tests of Mental Devel- 
opment might appropriately be called a scale, although he does not 
use the word. The instrument has an age range extending from 3 
months upwards. r 

It consists of 89 subtests and 19 supplementary subtests, some 
from the original material of Binet, some from his own revisions 
of the Binet scale, and some new material of his own. 

In the preliminary work of assembling and trying out the sub- 
tests, 121 tests were administered to about 15,000 persons from 3 
months to 60 years old. The choice of suitable tests for the battery 
was made on the following criteria. (a) Preference was given tO 
those which showed large increases in raw scores in the case of 
tests with several elements which were to appear at several levels. 


+ Reference: Kuhlmann, 1939. 
° 
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Several of these are shown in Figure 9. (b) Preference was given 
to those which yielded large increases in the percentages of sub- 
jects who passed from age to age in the case of tests which were 
either passed or failed. Several of these also are shown in Figure 8. 
(c) Tests were selected in order to give a wide range of raw scores 
in any single age group. This is contrary to usual standards in test 
Construction. Ordinarily it is thought desirable to have a minimum 
variability within an age group, but Kuhlmann disagrees with 
this criterion. His view is that tests which show a wide variability 


` are desirable because they will show qualitative differences more 


clearly and thus provide a better diagnostic instrument. (d) Pref- 
erence was given to those tests which showed high correlations 
with total scores on the entire battery. (e) It was thought desir- 
able to include tests with a wide variety of make-up. (f) Prefer- 
ence was given to tests which were as free as possible from the 
effects of coaching, practice, and variable training (Kuhlmann, 
1939). . 

B. Standardization and scaling. A standardization group of 
about 3,000 was used, yielding about 106 subjects for each age 
Sroup in the preschool range, and about 140 each year for school 
children. The tests, however, were not organized into age groups 
or levels as with the Stanford Revisions, but into a scale of sub- 


tests of increasing difficulty. 
The basis of he scaling is novel and distinctive. It depends upon 
the alleged curve of mental growth developed by Heinis (g.v.). 
einis derived what he considered a standard curve of mental 
Stowth from various test data on populations ranging 1n age from 
2 to 18 years. He believed it to represent the true course of human 
Mental growth. And he reduced the growth curve toa ee 
cal formula. For instance, his formula indicates 16 degrees o 
Mental development between the ages of 16 and 20, 15 units or 
€grees between 20 and 30, and 3 between 30 and 40. The number 
Of degrees of mental growth is much larger between earlier pairs of 
ages. This, as will be seen, corresponds in a general way to Ter- 
man’s view that mental age steps decrease In real value as age 
advances, But Heinis went much further than this. He claimed 
that his curve and formula reveal the real course of normal 
development. Kuhlmann has adopted this view, and the tests in 
is instrument were arranged to yield passing scores at me 
Points of true development as shown on the Heinis veg? e 
Mental age values which are assigned to the subtests, as is illus- 
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25. 


35 


53+ 


61. 


72. 


o-4 (21) SITTING WITH SUPPORT 


Child is placed in chair supported by back. 
Passed at 21 if he sits up for thirty seconds,* 


1-3 (93) NAMING OBJECTS SHOWN 
Child is shown five objects and asked to name them, 
M.U. — 93 — togë 
B — i= g 
2-6 (135) NAMING OBJECTS FROM MEMORY 


Child is shown two objects and asked to name them, then 
told to shut his eyes and name them from memory. 


M.U. — 135 — 153 
R = r= 2 
3-7 (177) RECOGNITION OF MISSING PARTS IN 
PICTURES 


Child is shown series of pictures with missing parts, and 
asked to name them. 
M.U. — 177 — 210 
R — 2— 4 
5-1 (228) REPETITION OF NUMERALS 
Repetition of digits given orally, six sets in all, 
M.U. — 228 — 252 
R — 2— 3 
6-2 (258) SIZE OF VOCABULARY 
Telling meanings of twenty-five words, 
M.U. — 258 — 297 


== gaz 
R = g= 8 


a I2 
8-8 (312) FINDING WORD AMONG SIX THAT COM- 
PLETES AN ANALOGY 


A pair of words which establish a relationship, task being 
to choose among six given words the one that establishes 
same relationship to a third word, e.g. finger is to hand a5 


toe is to —. 
MU. — 312 — 345 
45 = 8 
R 37 
T — -°5t — .079 — .105 


——: 


“a 
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89. 12-6 (363) DRAWING UPRIGHT FORMS IN INVERTED 
POSITION 


Series of designs presented on cards, with instructions to 
draw them as they would appear in inverted position. 


— 363 — 396 — 429 — 462 — 495 = 528 


— 012 — .032 — .052 — .072 — .092 — .II2 


WAG 


Fic. 9. SAMPLE SUBTESTS FROM Tests or MENTAL DEVELOPMENT. 
(Scare CONTAINS 89 SUBTESTS IN ALL) 


(Kuhlmann, 1939) 


trated once more in Figure 9, were worked out by placing the 
subtest in the age group where it is passed by 50% of the subjects. 
Thus Kuhlmann rejected the rather elaborate trial-and-error 
manipulation and adjustment of subtests which was used in 
assigning them to the appropriate ages in the Stanford-Binet scale. 

E, Scoring. The scoring is unique and somewhat intricate—too 
much so, in fact, for a full account to be repaying here—although 
the claim is made that when a person becomes accustomed to it he 
finds it feasible enough. For the easier tests, up to number 63, the 
Score is simply the number right. Above that level a combination 
of speed and accuracy is used, the tests being timed to about 2 
minutes and the speed scored by dividing the time the subject 
takes to complete the test by the number of seconds. In order to 
Penalize the inaccurate subject the multiplication of speed by 
accuracy is resorted to. Kuhlmann’s general reason for his unusual 
emphasis upon speed and timing is that while in itself it is not 
Very significant, it becomes so in connection with exacting prob- 
lematic tasks, The obtained raw scores resulting from these scoring 
Procedures are converted in terms of the Heinis curve into mental 
units, symbolized as M.U., which show the developmental value 


baad i ; hich appear in connection with these tests require 
eplash aa Dus amboi ii n to give the reader some concrete idea of 
uhlmann’s method of scoring and of the general layout of the scale. Beginning 
With the top line, the first numeral on the left is the serial number of the test. 
ext appears the age level in years and months at wile the test scores wt 
ext appears the equivalent growth curve level in mente units. Below ae rie 
description are the scoring indications. Sometimes the score is the number of 
Tight responses (R), Sometimes it is the number of right responses divided by 
the time fabs (R/T). The rating in mental units (M.U.) for various score 


‘ Values appears immediately above the scores. 
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and meaning of the subject’s test performance. Since on the basis 
of the Heinis curve we know what the developmental status or 
M.U. score or average for any given chronological age is, and since 
the test gives the subject’s rating in terms of his actual M.U. score, 
it is possible to know his status with reference to average or ex- 
pected mental development. So there is derived a score which 
Kuhlmann calls the Percent of Average, or P.A., which expresses 
the percentage of average or expected mental development for his 
age which any given person manifests. Kuhlmann regards this 
score as superior to and more meaningful than the I.Q. But as will 
be clear from an examination of Figure 9, and from the above ex- 
planation, the instrument can also yield mental ages and intelli- 
gence quotients. 


2. Critique and Appraisal 


A. Correspondence with Binet. The essential point of resem- 
blance between the Tests of Mental Development and the original 
Binet scale and the two Stanford Revisions is that all of them are 
composite unanalytic instruments containing a multiplicity of 
items. Indeed one might say that the basic operative conception 
of general intelligence embodied in the instrument is even more 
loosely defined than with Terman and Binet. Kuhlmann is evi- 
dently committed to the idea of a diverse over-all survey of the 
mentality of a subject. In fact he expressly repudiates the notion 
of asking exactly what it is that the test measures, or what precise 
psychological function is involved. In this respect the instrument 
now under consideration does not realize one of the values attrib- 
uted to point scales by Yerkes and Foster and summarized in 
Figure 8. The contention there appears that point scale subtests 
are or can be constructed in terms of the “function to be meas- 
ured.” Kuhlmann, however, does not regard this as a positive 
value. He insists that the true significance of the work of Binet 
is the high relationship of his tests to “practical facts,” such as 
school achievement, occupational level and success, adjustment to 
social standards, and the like (v. Kuhlmann, 1939-40). It is, of 
course, this point of view which explains, and so far as it is valid 
justifies, the composite, inclusive, unanalytic character of the Tests 
of Mental Development. 

B. Divergence from Binet. While Kuhlmann, in the Tests of 
Mental Development, retains the idea of age levels and of a devel- 
opment of mentality related to chronological age which is char- 


i 


i 
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acteristic of Binet, he treats it differently in terms of the organiza- 
tion of the instrument and in the method of interpreting test 
performance. The mental age values of the subtests are given, but 
they are not set up expressly as an age scale. Also the use of the 
Heinis growth curve is a very distinctive feature. Binet and the 
workers responsible for the two Stanford Revisions refrained 
from any particular claims as to the course of mental develop- 
ment. Kuhlmann’s scheme of scoring and interpretation depends 
upon a highly specific claim. The idea underlying the M.U scores, 
and of the P.A. or Percent of Average, is that normal mental 
growth is ascertainable and in fact that we know just what it is. 
Without opening up the whole question as to the nature of mental 
growth, which must be considered later on, it is clear that con- 
siderable doubts suggest themselves here. — À 

C. Reliability, Kuhlmann’s distinctive views on the subject of 
the reliability of tests must be noted here, for they affect both the 
organization and the interpretation of the Tests of Mental De- 
velopment. He has from time to time argued that high reliability 
is by no means so desirable as is ordinarily supposed. When a 
Measuring instrument is made very reliable, it is not affected by 
Variable errors to any extreme degree. But for Kuhlmann many 
of the variable influences which can affect test performance are 
not sources of “error” at all, but perfectly legitimate factors that 
ought to be recognized, and accordingly should be reflected in the 
test score. Among these might be the mood, the attitude, the physi- 
cal condition of the subject, his fatigue, or his boredom. Granting 
that these influences do affect mentality, one can understand why 
Kuhlmann feels that an instrument designed to override and 
ignore them as far as possible, points directly towards falsifica- 
tion. This is why he declines to enter into the question of the 
reliability of the Tests of Mental Development. — , 

D. Stability of scores. Kuhlmann makes the claim that his char- 
acteristic measure, the Percent of Average, which he names what 
Heinis called the Personal Constant, is decidedly more stable than 
the Stanford-Binet I.Q. That is, it retains its meaning in terms of 
distance from the mean with less fluctuation at different age levels. 
The probable errors for the Percent of Average at different age 
levels are shown in Table 11. At first inpe Gon; at least, it would 
seem that these variations are much 4 ome those reported for 
the standard deviations of the Stanford-Binet Intelligence Quo- 


tients in Table ro. 
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E. Practical values and limitations. The 
leads immediately to a consideration of the pr 
limitations of the instrument, for Kuhlmann’ 


, too, is why 


3 c y integrated instrument 
rather than one built logically on a well-defined concept. 


TABLE 11 


ProsasLE Errors * oF PERCENTS oF AVERAGE 


AT VARIOUS Ace LEVELS 
(Adapted from McNemar, 1942 


, Table 51, P. 162) 


7 
Probable Errors. , G 8 | g 7 


with accuracy and difficulty i 


* The probable error is .6745 times the standard deviation, This must be borne 
in mind in comparing the above with standard deviations of intelligence quotients. 


— 
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FAR-REACHING REVISION OF BINET’S PRACTICES 


1. Wechsler-Bellevue Intelligence Scale * 
This is the outstanding example of an instrument representing a 
far-reaching revision of the practices and ideas of Binet, yet not 


unrelated to them. 


1. Characteristics 

A. General characteristics. This scale is an instrument for indi- 
vidual use. It is applicable to ages from ro to 6o and upwards. It 
is particularly designed for adults and is regarded as the most gen- 
erally satisfactory instrument for the measurement of adult_in- 
telligence. 

A synoptic outline is presented in Figure ro. As will be seen, it 
consists of ro units or subtests and one alternative. Each test unit 


is applicable to a wide age range, with differential scoring. Thus it 
illustrates one of the distinctive advantages claimed for point 


Scales by Yerkes and Foster. 

The test units or subtests are related and combined to form four 
Separate but interrelated scales of intelligence. First, there is the 
Main Individual Examination for ages 10 to 60, consisting of all 
the test units. This can be reduced to 7 instead of ro tests, if so 
doing seems desirable in the light of the adjustment and type of 
the subject. Second, there is the Adolescent Scale for ages ro to 
16, using the same test units, but with a different standardization. 
Third, there is the Performance Scale consisting of subtests 6 to 


10 inclusive, Fourth, there is the Verbal Scale, consisting of tests 
h ilary test as an alternative. 


I to 5 inclusive, with the vocabu l 
Wechsler (1944) has presented an evaluation of the subtests in 
the light of their relationship to performance on the whole scale, 


and of their general, psychological character. (a) As to the general 
information test (40. 1), he points out that any information test 
depends for its value on the use of information items that are com- 
mon knowledge. He finds that his test successfully samples general 
information, but that it is not so successful for those with special 
Opportunities and training. Its correlation with performance on 
the whole scale is .66,for ages 20 t0 34, and „68 for ages 33 to 49. 
(b) The arithmetical reasoning test (no. 3), like all tests of this 
type, was easy to make and to standardize. Its correlation with 


* Reference: Wechsler, 1944- 
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——_ 


1. GENERAL INFORMATION 
Twenty-five information questions to be answi 
Wrong, in order of difficulty. 


2. GENERAL COMPREHENSION 
Ten questions—what to do?—why thus and so? 
3. ARITHMETICAL REASONING 
Ten timed verbal arithmetic problems, 
4. DICIT REPETITION 
Fourteen sets of digits, ranging from three to nine per set, to 


be repeated forward. Fourteen sets, from three to eight, to be 
repeated backward. 


5. SIMILARITIES 
Twelve pairs of words, task being to indicate in what way \ ý 


ered Right or 


t 


they are similar. 
6. PICTURE COMPLETION 


Fifteen cards showing pictures each with a part missing, task ( 
being to indicate what is missing. | 
i 


7. PICTURE ARRANGEMENT 
Six sets of cards, each set “telling a story” 
the catching of a fish, the task being to arr. 
proper order. 

. OBJECT ASSEMBLY 
Three sets of cutouts, which assemble into three objects— 
manikin, profile, hand—task being to put them together. 

9. BLOCK DESIGN 

Sixteen cubes in colors, nine designs on cards, task being to 
reproduce the given designs by means of the blocks. 

to. DIGIT SYMBOL 

x Nine divided boxes as shown below, 


in sequence, e.g. 
‘ange each set in 


» giving digits each with 
corresponding symbol, task being to write correct 5 | 
symbols under sixty-seven numbers, xÍ 


1r. VOCABULARY 
Subject to tell meaning of forty-two words. 


Fic. 10. WECHSLER-BELLEVUE INTELLIGENCE SCALE. 
SYNOPTIC OUTLINE 
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the whole scale is .63 for ages 20 to 34, and .67 for ages 35 to 49. 
(c) The digit memory test (no. 4) is again easy to handle. But it is 
a poor test. It is retained because it is effective for lower age 
levels, and because it often diagnoses mental defect. Its correlation 
with the whole scale is reported at .51. (d) The similarities test 
(no. 5) is one of the best included. It correlates with the whole 
scale .73. (e) The picture arrangement test (no. 7) depends for 
its effectiveness on the scenes depicted. Thus a picture of a bird 
building its nest might have different discriminatory and other 
values from one of a policeman chasing an automobile. The 
attempt was made to use familiar scenes for the items of this test. 
However, it does not discriminate well in terms of age levels. Its 
correlation with the whole scale is .51. (f) The picture completion 
test (no. 6) meets all criteria quite well. It correlates with the 
whole scale .61. (g) The digit-symbol test (no. ro) correlates with 
the whole scale .673 for ages 20 to 34 and .697 for ages 35 to 49. 
(h) The block design test (no. g) is a good test, and is effective in 
picking out those low in intelligence. It correlates with the whole 
Scale .73. (i) In constructing the object assembly test (no. 8), a 
difficulty was to secure familiar configurations to be assembled. It 
has been retained because it makes significant additions to the 
total score, and because it gives opportunity for the examiner to 
make a qualitative analysis of the subject’s mental processes. Its 


Correlations with the whole scale are .41 for ages 20 to 34, and .51 
lations are low because there were 


f small groups of subjects in the 
Standardization group. (j) The vocabulary test proves to be an 
excellent measure of school achievement and of general intelli- 
Bence. It correlates .85 with the scale as a Whole. 

A second form was published in 1946, and a special army adap- 
tation, known as Form B, was prepared during the war. 

B. Scaling and standardization. The norms are based upon 
Standardization groups 0 j: 
from about twice as many persons 1n proportion to the occupa- 


tional distribution of the white po 
cts for each age group from 7 


here were from ṣo to 175 subje : 
to 70. Most of the adult age levels are at five-year intervals. 
i C. Scoring and administration. The scoring scheme of the test 
introduces another unique feature. Since it extends upwards to the 
chronological age group from 60 to 70, it does not yield intelligible 
mental ages. Wechsler in fact is very critical of the concept of 
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mental age (1944). He points out that a mental age has about it 
nothing sacrosanct or mysterious, and that it is simply a score like 
any other score. Moreover, as he says, it is just as anomalous to 
talk about a person 60 years old as having a mental age of 16 as 
it would be to say that a child of 10 had a mental age of 60. For 
to repeat, a mental age is simply a score. Thus a mental age of 122 
months on the Stanford-Binet scale presumably means 61 items 
credited. 

With this in mind Wechsler disregards the mental age concept 
entirely, and converts the raw scores yielded by the scale into his 
own units of measurement directly. These units he calls intelli- 
gence quotients, but since there is no mental age determination, 
they cannot be obtained by dividing it by chronological age in the 
usual way. In fact, his intelligence quotients are not in reality 
quotients at all. The I.Q. is figured always with reference to the 
distribution of scores of the age group. The I.Q. value of go is set 
for the score which is 1 probable error below the mean score of 
the age group. The I.Q. value of 110 is set for the score which is r 
probable error above the mean score of the age group. 

The probable error (or P.E.), it should perhaps be explained, is 
a measure of dispersion which is .6745 times the standard devi- 
ation. When a distribution is normal, that part of it which lies 
between 1 P.E. above and 1 P.E. below the mean will include 
50% of the cases. This is the reason for Wechsler’s choice of the 
P.E. in determining the values of his derived scores. By common 
acceptance the I.Q. range from go to r10 includes those of “nor- 
mal” mentality or “average” mentality, which is thought to be 
about 50% of the population, Thus by defining 1.Q.’s of 90 and 
rro as 1 P.E. below the mean and 1 P.E. above the mean, Wechsler 
sets them at levels which may be expected to include 50% of the 
population. The general logic of this procedure is analogous to 
that in working out standard scores as explained above. Standard 
scores, of course, are based upon the standard deviation as a unit 
in terms of which to measure the distance of any obtained raw 
score from the mean. Wechsler, for the reasons just given, prefers 
to use the probable error. But there is a fixed relationship to ordi- 
nary standard scoring, because the P.E. is always .6745 times the 
S.D. Once the I.Q. values of 90 and 110 have been determined, it 
is an easy matter to work out all other values and their equivalents 
in raw scores. Wechsler presents tables in his book from which 
the I.Q. equivalent of any raw score can be read off. 


| 
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2. Critique and Appraisal 


A. The scoring units. Wechsler’s purpose in his distinctive re- 
definition of the intelligence quotient is abundantly clear. It is the 
wish to have a numerical score that will always mean the same 
thing at all age levels. An I.Q. of go always means a score which 
is 1 P.E. below the mean, and this significance is constant. If a 
person makes an I.Q. score of go when he is 10, and the same score 
when he is 20, and the same again when he is 40, one need not 
ask whether or not some change in his relative standing has taken 
place which is obscured by the seemingly identical rating. This, as 
we have seen, is not quite the case with Stanford-Binet I.Q.’s. An 
I.Q. of 150 on Form L of the Revised Stanford-Binet scale does 
not indicate the same deviation from the mean at the age of 6 
as it does at the age of 12. The reader should carefully compare 
the data in Table 10 with the material in Table 12 and in Figure 6. 
He will note the much greater stability of the standard deviations 
of the Wechsler-Bellevue I.Q.’s, age for age, compared to t hat of 
the Stanford-Binet 1.Q.’s. It can hardly be denied that this indi- 
cates a considerable superiority, at least in _ this respect, for 
Wechsler’s scheme of scoring. A given I.Q. obtained by the use of 
his scale, and derived from the obtained raw score, always has 
the same significance at any age level, in the strict statistical sense 
Of representing always the same deviation from T Eei w ni 
ess technical language this means that a oe S: Ti PAYS, 
be equally “hard” or equally “easy” to get, or that it wi wa an 
individual in the same position relative to other persons in his age 
category, ” n - 

The a oi the term I.Q. for this score 1S, ncn a cai ier 
Siderable objection. It was adopted by Wec ‘del familiar. But 
Prudential reasons, and because the ee he enid sense of 
it means something decidedly differen d in test construction. And 

e expression, both in popular use ane 1 he terminology of 
it introduces an element of confusion into the s 


= cls i . “Vs è 
B Reliability and validity. The Sens cagon reported 
; re i o. 
y We the whole scale are in the order ot .90. |, 

As to aidi the general conception which the tei H — 
to embody is made quite explicit. Intelligence, as Wechsler un aA 
stands it, is the aggregate or global capacity al rs oa 
purposefully, to think rationally, and to deal effectively w 
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environment. Wechsler recognizes the importance of nonintel- 
lectual factors in this inclusive capacity. But he argues that if a 
test reveals enough of what general intelligence is to enable one 
to predict global capacity with reasonable confidence, it is satis- 
factory. 


TABLE 12 


MEANS AND STANDARD DEVIATIONS oF I.Q.’s ror AcE Groups on 
WrewsteR-BELLEVUE INTELLIGENCE SCALE 


(Quoted from Wechsler, 1944, Table 19, p. 122) 


No. Mean Standard 
Age of Cases I.Q?s Deviations 
Io 60 101.25 13.20 
Ir 60 100.84 14.10 
12 60 100,08 13.80 
13 70 100.57 14.70 
14 70 99.93 14.75 
15 100 100.00 14.57 
16 100 100.30 15.15 
17-19 100 98.75 14.50 
20-24 160 100.16 13.70 
25-29 195 100,89 14.60 
30-34 140 99.67 15.60 
35-39 135 99.75 15.50 
40-44 gt 100.30 14.80 
45-49 7° 100.07 14.01 
50-54 55 100.50 13.97 
55-59 50 99.1 16.85 
60-69 TOS 99.84 15.26 


ee ll 2 eee 


The internal validity and self-consistency of the instrument has 
been ascertained, and the degree of it is reported in the dat@ 
already presented, particularly in regard to the selection of the 
subtests and their correlation with the scale as a whole. 

As to external validation, Wechsler regards correlation with 
similar tests as a minimum requirement in all cases, The obtaine 
correlations in the present instance are shown in Table 1 Bs 
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TABLE 13 


CORRELATIONS or OTHER Tests WITH WECHSLER-BELLEVUE 
INTELLIGENCE Scatr 


(Adapted from Wechsler, 1944, Table 28, p. 134) 


Number Correlations 
with 
Name of Test of Wechdar: 
Cases Bellevue 
E E S S 
Stanford Revision of Binet Scale (1916)...+-++- 75 82 
Stanford Revision of Binet Scale (1916)... +. ++- 61 81 
Revised Stanford, Form L (1937)+++++ 55 91 
Revised Stanford, Form L (1937)+++++++ 112 62 
Stanford Revision of Binet Scale (1916). . 125 ‘57 
Revised Stanford (1937) seeeeeeeereets 60 .93 
Army Group Examination Alpha... +--+. Dana 92 74 
American Council on Education Psychological 
Examination for College Freshmen...++++++++ 112 53 
Morgan Test of Mental Ability. .. -+++ 114 125 62 
Henmon-Nelson Test of Mental Ability a| so 81 
LER. Intelligence Scale, CAVD. s.e tettet 108 .69 
LER. Intelligence Scale, CAVD. ++ +++ +++: -| 60 39 
Otis Self-Administering Tests of Mental Ability. . es E 
o 7 


Otis Self-Administering Tests of Mental Ability. . 
ee 
Balinsky and Wechsler (q.v.) have undertaken to compare the 
Clinical validity of the Wechsler-Bellevue scale with that of the 
tanford-Binet. They applied both tests to two groups of retarded 
or disordered patients, and adopted as their validation criterion 
the record of the later commitments of = — ta a aie 
Instituti fectives, which was decided on the basis 
Miagi or mentaldee i histories and from psychi- 


ol a study of all the facts from case l 
atric interviews and observations. The data obtained are sum- 


marized ; _ It will be seen that the correlations of the 
ed in Table 14 stently higher, and in two in- 


€chsler- scale are consi ] 
Stances i higher than those yielded by the Revised 
Stanford-Binet scale. Altus (g.v.) bas published subtest correla- 
tions from .334 to .585 for Army trainees against criterion of suc- 
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cess in training, using Form B. Considering the shortness of the 
Form B subtests and the narrow range of ability of the group, 
these indicate a significant degree of relationship. 
TABLE 14 
CORRELATIONS oF Two SCALES WITH COMMITMENT TO INSTITUTION For 
FEEBLEMINDED 
(After Balinsky and Wechsler) 


CORRELATIONS OF I.Q.’S WITH INSTITUTIONAL 
NUMBER OF CASES COMMITMENT 
IN EACH GROUP _ 


“= Wechsler-Bellevue Revised Stanford 
49 -753 -664 
36 -720 II 
81 +791 +325 
63 +785 274 


C. General points. The Wechsler-Bellevue scale is intrinsically 
superior for adults in comparison with the Revised Stanford-Binet 
scale. It does not involve the difficulty encountered by the latter 
in connection with ages above 16. Its units of measurement are 
more realistic and exact, in spite of the rather unfortunate use of 
the term intelligence quotient. The subtests are better suited for 
older persons than the upper-level subtests of the Stanford-Binet. 
Besides this, it is a more flexible instrument, though it might be 
made still more so if norms were available for the separate sub- 
tests so that they could be used independently at the discretion 
of the examiner. 


2. Detroit Tests of Learning Aptitude * 

Another instrument comparable to the Wechsler-Bellevue scale 
in general make-up, working principles, and relationship to the 
ideas of Binet is the Detroit Tests of Learning Aptitude. It con- 
sists of 19 subtests, 13 of them being used for ages 3 to 6, 15 for 
ages 9 to 12, and 13 for ages 14 and over. Representative samples 
are shown in Figure ri, A separate M.A. is obtainable from each 
subtest, which is an innovation in line with advanced notions of 
test construction. However, in comparison to the Stanford-Binet, 

* Reference: Baker and Leland. 


ya 
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the Kuhlmann tests, and the Wechsler-Bellevue, its standardiza- 
tion is inadequate, and the analytic interpretations presented in 
the manual by the authors are unconvincing. It is supposed to 
reveal reasoning and comprehension, practical judgment, verbal 
ability, number ability, auditory attention ability, visual attention 
ability, and motor ability. The authors also claim that these abil- 
ities are embodied in various school subjects. But the whole inter- 
pretation lacks statistical foundation, and while basic mental traits 
may perhaps be revealed by factor analysis, they certainly cannot 
be by direct inspection, for then they appear uncomfortably like 
faculties. Moreover, the title, mentioning “Jearning aptitude,” is 
attractive. But one must remember that the specific relationship 
between mental test performance and learning performance is 
slender and doubtful. All in all, criticism on the grounds of being 
a “hodgepodge” seems far juster here than when directed against 


the Stanford-Binet scale. 


’ 
COMPLETE DEPARTURE FROM BINET’S PRACTICE 


1. LER. Intelligence Scale CAVD (Thorndike and Others) * 


This scale marks a very definite departure from the ideas and 
Practices of Binet. It is for individual administration at the lower 
age levels, and can be used as a group test at upper levels. It is 
applicable from the age of 3 years to upper adult levels. 

It embodies very clear and explicit conceptions. Thorndike, as 
already pointed out, regards intelligence as manifesting the four 
attributes of altitude as determined by the difficulty of the tasks 
that can be done, range as meaning the spread of different tasks 
at each level of difficulty, area as meaning the total number of 
Possible tasks at all levels of difficulty, and speed. Altitude is the 
Most essential characteristic of intelligence, and range is said to 
Correlate with it perfectly, at least in theory, the relationship being 
Owered in individual cases by the effects of training, experience, 
and opportunity. Thus the scale is set up primarily to measure 
altitude of intellect. 

n It consists of four types of it 
N€tical problems, vocabulary, an 

ese types of items, Thorndike w 

considerations. 


ems: sentence completion, arith- 
d directions. In deciding upon 
as influenced by the following 


* References: Thorndike and Others, 1927; Thorndike, Woodyard, and Lorge 
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1. Pictorial Absurdities 

Pictures with something foolish in each, to be identified. 
2. Verbal Absurdities 

A series of absurd statements, absurdity to be identified. 
5. Motor Speed and Precision 

Putting crosses in numerous circles. 


6. Auditory Attention Span for Unrelated Words 


Two sets of unrelated words in items of increasing length, to 
be given orally and repeated. 


7. Oral Commissions 
Giving the subject instructions to do various things. 
8. Social Adjustment A 


Questions about social situations, e.g., what to do if one’s 
radio disturbs somebody, 


ro. Orientation 
42 miscellaneous questions. 
11. Free Association 


A list of stimulus words given orally to which the subject 
makes a free-association response. 


12. Memory for Designs 
Copying geometrical figures from memory. 
1g. Likenesses and Differences 


Pairs of things named, some alike, some different, subject to 
tell which and in what way. 


Fic. 11. DETROIT Tests or LEARNING APTITUDE. 
PARTIAL SYNOPSIS 


First were considerations having to do with psychological 
theory. (a) A response to parts or aspects of a situation is more 
“intellectual” than one to gross totals, (b) A response to parts Gi 
aspects of a situation not presented to the senses is more intellec 
tual than one to those that are presented to the senses. (c) A 1e: 
sponse to the relationships between objects is more intellectu 
than a response to the objects themselves. (d) A response to SUN” 
jective relationships, such as likeness and difference, is more in 


| 
J 


SS _ age 


ee 


aS 


a vene 
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tellectual than a response to objective relationships, such as those 
of space and time. (e) The organization of many mental connec- 
tions, i.e., “thinking things together,” is more intellectual than 
using one habit at a time. (f) Response to novel situations is more 
intellectual than response to familiar situations. 

Second are considerations related to the theory of measurement. 
(a) Tasks representing a single ability should be capable of very 
fine gradations from easy to hard. (b) Such tasks should be 
capable of very wide extension by alternatives at any level of 
difficulty. (c) So far as possible any one single ability should 
represent something varying only in amount. 

Third are considerations related to common sense. (a) The tasks 
should have high correlations with reasonable criteria of intellect. 
(b) They should be convenient for use. (c) They should be tasks 
for which subjects for experimentation are available. 

Thus Thorndike arrives at his notion of “intellect CAVD.” His 
explanation of the four designations is as follows. “C. To supply 
Words so as to make a statement true and sensible. A. To solve 
arithmetical problems. V. To understand simple words, D. To un- 
derstand connected discourse aS in oral directions or paragraph 
reading” (Thorndike and Others, 1927, P. 65): ad 

The four extensive subtests of the scale are subdivided into 17 
levels, designated from A to Q. On level A the expectation is that 
a child with a mental age of 3 will get 50% of the items right. On 


level s- attained by less than 10% of college 
snc tt sab l, F to H for elementary 


Students, Levels A to E are for preschoo or © 

School, G to K for junior high school, I to M for senior high school, 
and N to Q for college and graduate Jevels. There are in all 40 
items at each level, making 1° for each subtest. 

| A striking feature of the scale is that it purports “4 measure 
increments of intelligence from a true zero point and in equal 


wee A Ity are obtained by determining the 
s. The steps of equal ditaeu y the percentages of the 


difficulty 1 formances 
evel of test performa t t 
standardization group passing at each point, and by converting 
these percentages into scores based on the standard deviation of 
he as ields three kinds of scores, namely altitude scores 
penn d, range scores which mean 


which mean the percent of items passè 


th i at eac level, and area scores which 
© Percent of Ha jasse Tn its original form the scale is 


mean thi of all successes: 1 
difficult amen and to administer. A more practicable form has 
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been developed by Thorndike and Woodyard (q.v.). The obtained 
reliabilities and intercorrelations of subtests are high. 

The distinctive feature of the scale as a Psychometric instrument 
lies in Thorndike’s claim that it represents a true unitary sampling 
of intellect very sharply defined. In this it is in sharp contrast with 
Binet's own scale and the instruments more or less patterned on 
it, which are loose composites and defended as such. Also, it differs 
sharply from the Wechsler-Bellevue scale or the Detroit Tests of 
Learning Aptitude, which are also composite in character. 


SUGGESTED ADDITIONAL READINGS 


The reader who wishes to make a thoroughgoing study of any of 
the scales here discussed should turn to the references on each that 
are given in the text. However, the following suggestions are made: 

Rudolph Pintner, Jntelligence testing: methods and results (Rev. 
ed.; New York: Henry Holt and Company, 1931), Chapter 2, “The 
work of Binet.” 

Quin McNemar, The revision of the Stanford-Binet Scale. An 
analysis of the standardization data (Boston: Houghton Mifflin 
Company, 1942), Chapter 1, “The revision procedures.” 

Lewis M. Terman and Others, The Stanford Revision and Exten- 
sion of the Binet-Simon Scale for Measuring Intelligence (Baltimore: 
Warwick and York, Inc., 1917). This is a detailed technical account 
of the first Stanford Revision. 

Lewis M. Terman, The measurement of intelligence (Boston: 
Houghton Mifflin Company, 1916). This is a less technical account 
of the first Stanford Revision, 

Lewis M. Terman and Maud A. Merrill, Measuring intelligence 
(Boston: Houghton Mifflin Company, 1937). This is an account, not 
highly technical, of the second Stanford Revision. 

F. Kuhlmann, Tests of mental development (Minneapolis: Edu- 
cational Test Bureau, 1939). Contains a full account of the tests, 
their principles, scoring, etc. 

David Wechsler, The measurement o 
more: The Williams and Wilkins Compan 
tations and special merits.” 


f adult intelligence (Balti- 
Y, 1944), Chapter ro, “Limi. 


QUESTIONS FOR DISCUSSION 


1. Assemble from the chapter all the y. 
constructing intelligence scales with a v 
of demonstrating their validity. Consi 
prove, and what they prove. 


ne various methods used for 
lew to making them valid, and 
ider how much they seem to 


> 
=e, 
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2. Which of the two meanings of mental age seems to you more 
reasonable? Which is more commonly used? Read Otis 1916 and 
Thurstone 1926 on the matter. 

3. From the data in Table ro and Figure 6 show how any given 
I.Q. would shift at different age levels if the subject remained the 
Same standard deviation distance from the mean at all ages. Would 
similar shifts occur with the Percent of Average (Table 11)? With 
the Wechsler I.Q.’s (Table 12)? 

4. Compare the Stanford-Binet scale, the Tests of Mental Develop- 
ment, and the Wechsler-Bellevue scale to see how far the arguments 
in favor of point scales summarized in Figure 8 are borne out in 
connection with them. , 

5. What are the arguments for and against a “composite” or very 
diversified scale for the measurement of intelligence? Discuss this in 
connection with the specific scales here considered. 

6. In what respects do Terman and Kuhlmann seem to agree as 
to the course of mental growth? What difference do you find between 
them? 

7. List specifically and discuss the progressive steps away from the 
practices and ideas of Binet represented in the scales here considered. 

8. Why might a composite test be expected to have more clinical 
value than a highly unified one? Would this make the former a better 
test of intelligence? ; 7 

9. Why does an age of arrest occur in the scoring of the Stanford- 

inet scale but not in the scoring of the Wechsler-Bellevue scale or 
the LE.R. Intelligence Scale CAVD? , 

to. Vocabulary subtests are used in several of the scales here dis- 
Cussed. Which ones? Does this suggest any general attitude on the 


Part of psychometric workers to criticisms of such tests? 


CHAPTER V 
TESTS OF INTELLIGENCE (I) 


Group intelligence testing has an origin almost as definite as 
that which led to the type of instruments just discussed. It 
emerged as a major movement out of the work done in the United 
‘States Army in World War I. It was not at that time a wholly 
new undertaking, but it received a major impulsion from the Army 
work and its success. 

The evolution of intelligence testing, however, has been 
much less clear-cut than that of individual intelligence testing. 
Many of thc early group tests are still widely used, usually in 
revisions and sometimes with new names, with refinements and 
improvements, but without basic alteration. Moreo 
developmen‘ have been much more diverse. The 


Oricins: Army TESTING IN Wortp War I 


As has been said, the first major develo 
gence testing was the work in the Unit 
World War I. The original tests that we 
come outmoded, although some of their revisions are stil] used to 
some extent. But many of the persistent Problems of group meas 
urement defined themselves then, and many of its characteristic 
concepts and methods were established. Thus some understandi i 
of the original Army testing Program is valuable as leading i n 
comprehension of the development since that time. pees 


pment of group intelli- 
ed States Army during 
re constructed have be- 


1. The tests and their chief revisions * 


Two major tests were developed in connection with the Ar 


* References: Yerkes, 1921; Yoakum and Yerkes (for the may 


S tests); Gui 
1938); Wells (1932) (for Modified Alpha) ; Kello d > Guilford 
Bea), EE and Morton (for revised 
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Program. One was a verbal intelligence test, known as Army Group 
Intelligence Examination Alpha. The other was a performance 
test, known as Army Group Examination Beta. 

A. Army Group Intelligence Examination Alpha. This test was 
arrived at by a fairly prolonged experimental process, and is itself 
a revision of an earlier tentative instrument known as “A.” It 
Consists of 8 subtests. (1) Following verbal directions, intended 
to reveal the span of auditory attention, with a time allotment of 
4 minutes. (2) 20 arithmetic problems, time 5 minutes. (3) 16 
items involving “common sense” or “practical judgment” of what 
to do in described problematic situations, time 14 minutes. (4) 
40 pairs of words to be marked as meaning the same or different, 
time 114 minutes. (5) 24 disarranged sentences to be understood 
and marked true or false. (6) 20 incomplete number series to be 
Completed, time 3 minutes. (7) 4° verbal analogies, time 3 min- 
Utes, (8) 4o multiple-choice items calling for miscellaneous in- 


format fon: 
ight responses, except with 


for error is made to com- 
lly in five forms of equiva- 
ring in each, but 
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al Army knowledge and experience 
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CHAPTER V 
TESTS OF INTELLIGENCE (I) 


Group intelligence testing has an origin almost as definite as 
that which led to the type of instruments just discussed. It 
emerged as a major movement out of the work done in the United 

»States Army in World War L Te was not at that time a wholly 
new undertaking, but it received a major impulsion from the Army 
work and its success. ‘ . 

The evolution of intelligence ; testing, however, 
much less clear-cut than that of individual intelligen 
Many of the early group tests are still widely used, 
revisions and sometimes with new names, with refine 
improvements, but without basic alteration, Moreover 
developmen‘ have been much more diverse. The presi 
and the following one are devoted to this subject. The 
covered are: Army testing during 
marily for wide age ranges, tests for high school and college level, 
performance tests, tests for young children, tests for adults, new 
and emerging types of tests. Under each of these headings an 
attempt will be made to show the development that has taken 


place, and the whole discussion will end with a summary and 
appraisal of significant trends. i 


has been 
ce testing. 
usually in 
ments and 
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World War I, tests used pri- 


ORIGINS: Army TESTING IN Wortp War I 


As has been said, the first major development of group intelli- 
gence testing was the work in the United States Army during 
World War I. The original tests that were constructed have be- 
come outmoded, although some of their revisions are still used to 
some extent. But many of the persistent problems of group meas- 
urement defined themselves then, and many of its characteristic 
concepts and methods were established. Thus some understanding 
of the original Army testing Program is valuable as leading to a 
comprehension of the development since that time. 


1. The tests and their chief revisions * 


Two major tests were developed in connection with the Army 


* References: Yerkes, 1921; Yoakum and Yerkes (for the tests); Guilford 
(1938) ; Wells (1932) (for Modified Alpha) ; Kellogg and Morton (for revised 
Beta). 
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Program. One was a verbal intelligence test, known as Army Group 
Intelligence Examination Alpha. The other was a performance 
test, known as Army Group Examination Beta. 

A. Army Group Intelligence Examination Alpha. This test was 
arrived at by a fairly prolonged experimental process, and is itself 
a revision of an earlier tentative instrument known as “A.” It 
Consists of 8 subtests. (1) Following verbal directions, intended 
to reveal the span of auditory attention, with a time allotment of 
4 minutes. (2) 20 arithmetic problems, time 5 minutes. (3) 16° 
Items involving “common sense” or “practical judgment” of what 
to do in described problematic situations, time 1 minutes. (4) 
40 pairs of words to be marked as meaning the same or different, 
time 114 minutes. (5) 24 disarranged sentences to be understood 
and marked true or false. (6) 20 incomplete number series to be 
Completed, time 3 minutes. (7) 40 verbal analogies, time 3 min- 
Utes. (8) 40 multiple-choice items calling for miscellaneous in- 
formation, 

The test is scored on the number of right responses, except with 
Subtests 4 and 5, in which a deduction for error is made to com- 
Pensate for chance, The test was originally in five forms of equiva- 
lent difficulty, the same pattern of subtests appearing in each, but 
With different items. = 

evisions of Army Alpha. The First Nebraska Revision 
follows the original make-up and scoring plan. The revision is 
ased on an analysis of the items. The items found by correlation 
With the test as a whole and by their power of differentiation to 
© the most diagnostic were retained. Also, numerous items in the 
Original test which implied special Army knowledge and experience 
Were eliminated. The original norms were retained. Intelligence 
Wotient norms were worked out on the basis of the distribution 
of intelligence quotients as given by Terman and Merrill (1937 b) 
a the American White population. No data on validity or relia- 
llity are given in the manual. ats. c 
n the Schrammel-Brannan Revision, which is intended for 
grades 4 to 16, the original five forms are reduced to three. The 
subtests are retained. Oral directions and instructions to the 
Subjects, which played an important part in the original instru- 
ment, are reduced, and the test is made largely self-administer- 


ing Al i : ilitate scoring. Norms were 
So, troduced to facilita g 
pooyi a oo for each grade from 4 to 


Workeq i 

out for populations of about 7 : 
ia? together with boil 200 college freshmen and about 300 sub: 
Jects for each of the following three college years. Age norms were 
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extended from 8 to 25, but are best from 9 to 17. A progressive 
increase of scores with age up to 17 was found, after which the 
relationship of score to age became irregular. 

Another important revision, made by F. L. Wells and published 
in 1939, is known as Modified Alpha Examination (v. Wells, 1932): 
The practical judgment subtest is eliminated, and replaced by 4 
numerical subtest. Antiquated items are revised. The subtests are 
arranged so that separate numerical and verbal scores can be 
obtained. Percentile norms are supplied for high school boys, for 
high school girls, for seventh and eighth graders, and for adult 
men applying for executive positions. The reported self-correlation 
for total score is .92 with a standard error of 6.71. Total score 
correlations with the Otis Self-Administering Test of Mental 
Ability are .73 for high school freshmen, .64 for high school sopho-, 
mores, .72 for high school juniors, and .79 for high school seniors“ 

Another type of revision has turned in part upon the elimina 
tion of the separate subtests, and the organization of the whole 
instrument into a single “scrambled” or “spiral omnibus” form. 
The chief argument for this is administrative convenience, for 
whatever instructions are needed can be given all at one time at 
the start of the testing instead of breaking off at the end of each 
subtest. The subject responds to a series of items drawn in irreg"” 
lar order from the various subtests. s 

C. Army Group Examination Beta. This is a paper-and- 
pencil test, in general of “performance” type, intended for thosê * 
unable to read English. It was arranged so that the examine” 
could instruct the subjects by pantomime and with a minimu™ 
of words, and by means of blackboard demonstrations on how 
work. It consisted of 7 subtests. (1) 5 mazes, time 2 minute’, 
emphasizing speed. (2) 16 pictures of piles of cubes, the task being 
to give the number in each pile, time 214 minutes. (3) Nonverba 
completions, consisting of patterns of the letters X and O t0, 
completed in series as begun, time 134 minutes. (4) Associatio” 
of symbols with numbers according to a given code, time 2 mi 
utes. (5) A series of pairs of numbers to be compared, the tas? 
being to mark those in which the numbers were different, time a | 
minutes. (6) Drawing the missing parts of a picture, time 3 mi” 
utes. (7) 10 paper form-board problems, time 2% minutes.* 

* To understand the general character of the’ paper form board see Figure 1% 


A A cl 
This is a good concrete Illustration of the problem of translating performa” 
items into a form manageable in group testing. 
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The scoring is on the number of right responses, except with 
the code substitution subtest. 

D. Revision of Army Beta. The most important revision of 
Beta is that made by Kellogg and Morton. Certain not very 
important changes were made in the subtests. (a) The maze test 
Was retained from the original Beta. (b) Tests 2 and 3 from the 
Original Beta were eliminated (cube analysis and X-O). As sub- 
Stitutes were introduced a picture discrimination test in which 
the task was to find the wrong item in a picture, and a modifica- 
tion of the digit-symbol test, according to which the subject writes 
i numbers instead of symbols as he previously did. (c) Test 5 in 
$ € original Beta (number comparison) was extended to include 
Pairs of pictures as well as pairs of numbers, and also pairs of 
Symbols. (d) Tests 6 and 7 in the original (picture completion 
and form boards or geometrical constructions) were retained in 

eir previous form, except that they were Jengthened. sae 

he test was standardized on Canadian children but is suitable 
for use in the United States. A split-half reliability of .987, and 
retest reliability of .77 is reported. The authors also give mental 
age equivalents for the various scores. But they do not give mental 
age distributions at the various ages, Which makes the calculation 
of intelligence quotients meaningless. On the whole, the revision 
'S better than the original test, which merely served for the rapid 


Measurement of illiterates. 


2 Points of significance for psychometrics 


In the construction, revision, and use of these tests a number 
of considerations of permanent and far-reaching significance for 
Mental testing were involved (v. Yoakum and Yerkes; Yerkes; 

Toughout). E fon et 
inio The conception of intelligence. A working conception o 
Ntelligence was set up as a guide and a criterion for validation. 
St vonsisted of (a) formal school attainment; (b) scores on the 
tatiford. ‘Revision af ther Binet scale; (c) ratings by oficera: 
Ratings on these three variables were obtained for a standardiza- 
on group of about goo: A preliminary form of Alpha was found 
© Correlate 75 with the first, from .80 to .90 with the second, and 
from .50 to FO with the third. Alpha was found to correlate .72 


With s ve Completion Scale, an early 
Cores on the Trabue Languas : 
Stoup intelligence test. Total scores 00 Alpha correlated .80 with 


to 
al scores on Beta. 
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EXAMPLE A 


/A 
AA 
AA 


Which of the five figures A. B. C. D, E reproduce the two 
in the example placed together? 


SAMPLE ITEM FROM THE REVISED MINNESOTA 


Fic. 12. 
Paper ForRM BOARD 


forms 


pe 
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aie ag for selecting subtests. Two criteria were kept in 
selection of subtests. These have always been re- 
garded as of importance since that time. (a) It was desired that 
the separate subtests should have the highest possible correlations 
with total scores on the test. In the case of Alpha, the lowest 
Correlations were found in the case of verbal directions, practi- 
Cal judgment, and disarranged sentences, the coefficients running 
around .65. The highest, running around .85, were found for 
arithmetic problems, verbal opposites, verbal analogies, and infor- 
mation. (b) It was considered desirable that the subtests, while 
Correlating as high as possible with the test as a whole, should 
Correlate as low as possible with one another. This is a com- 
monly accepted statistical norm, for which there is an obvious 
reason. Closely related subtests clearly involve much in common. 
hus each one will give less new information about the subject 
than in the case of subtests which are not closely related and have 
little in common, Put in statistical terms, if the subtests have high 
intercorrelations, each adds little to the validity of the total final 
Score. Thus low intercorrelations are statistically désirable. But 
Often they are psychologically impossible, for the reason that the 
Subtests are not built to measure distinct and different mental 
factors, and overlap greatly in their factorial content. This was 
found to be so in the present case, for the mean of intercorrela- 
tions of the subtests of Alpha was about 61, and those with the 
Closest relationship to total scores had the closest relationship to 
One another. So in the Army test construction this second criterion 
lad to be abandoned, which has often happened in such work 
Since then. ate 
C. Specialized items. Army Group Intelligence Examination 
Alpha, in original form, contains many items calling for military 
experience and information. This has been a point of criticism, 
and it has been altered in the later revisions of the test for general 
Use. However, as F. N. Freeman (1939) points out, the acquisition 


= ideas and information in a uniform environment is a legitimate 
ation has assumed an increas- 


ae of intelligence. This consider 

8S Importance in recent years, for there has been a tendericy to 
ference to the background and 
h as medical students, naval per- 
try, and the like. 

erable general importance 


in the Army work was that of the 


n intelligence tests with re 
So ests of special groups, suc 
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whi, Speed and power. An issue of consid 
ich was raised conspicuously 
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significance of speed and power in intelligence testing. Alpha, as . 
will be seen from the description given above, was a closely timed 
test. What would happen to the showing of groups of subjects if 
the time limits were extended or even removed? Apparently noth- 
ing very striking, for when double time was allowed, the scores 5° 
obtained correlated .965 with those obtained under the ordinary 
time limits. When unlimited time was allowed, the correlation 
with regular timing was .945. Similarly Ruch (1924), using the 
Terman Group Test of Mental Ability with 88 children in the 
eighth grade, had them work with black pencils up to the time 
limit, and then continue with red pencils, thus making it possible 
to see what was done in the added time. Apparently it made 1° 
considerable difference to the distribution of the scores, for thé 
correlation between scores on regular and added time was 9” 

There has been considerable debate as to the meaning of thes? 
results (Yoakum and Yerkes; F. S. Freeman, 1928; F. N. Free” 
man, 1939; Ruch, 1924; Ruch and Koerth; Brigham; Stoddard, 
1943). On the one hand, it is argued that correlations betwee" 
standard and'extended time would be low only if the test were 
a power test. On the other hand, it is contended that the obtaine 
high correlations indicate that the differential factor of speed 
makes no particular difference. It must be remembered, however 
that high correlations do not mean that all the scores in a distrib: 
tion remain the same. It only means that they retain the sam 
relative position; and to fulfill this condition, they might all i0 
crease or decrease regularly or they might temali unchanged. re 
truth probably is that Alpha is a power test for those of low intelli 
gence, who could not raise their scores very much no matter Be 
much time they had, and a speed test for the able, who make suc 
good scores even under a time limit that there is not much rong 
for improvement. 

As to the issue in general, in some tests, such as those for type 
writing or stenography, speed may be the main consideratio” 
although even there it is associated with accuracy. In intelligen™” 
tests, however, power is clearly the important factor. In any sil g 
tion involving intelligence, speed may be an indication if i X 
combined with diffculty, for the power to do a difficult tae 
quickly is clearly significant. Thus a pure speed test which d 
simply a series of homogeneous low-level tasks cannot þe a hae fi 
test of intelligence. And as Thorndike (Thorndike and ot ant 
1927) remarks, the time taken to do a job can only be signific 
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if the job is performed substantially correctly. Even then there is 
a question, for one would not say that if Einstein had taken twice 
as long as he did to work out the basic formula for the theory of 
relativity, this proved that he had much less intelligence than we 
now suppose. 

E. Mental age equivalents. By all means, the most popularly 
exciting issue in connection with the Army work turned on the 
| mental age equivalent of scores on Alpha. Also, this issue involves 

much of far-reaching importance for psychometric theory and 
Practice. 
Mental age equivalents of various levels of Alpha performance 
Were found by comparing the Alpha scores with the Stanford-Binet 
Mental ages of the standardization group to whom the latter test 
had been given. Following the letter rating shown in Table 7, 
the mental age equivalents were these: D— (scores 0-14) corre- 
SPonded to an M.A. range of o to 9-4. D (scores 15-24) corre- 
SPonded to an M.A. range of 9-5 to 10-9- C— (scores 25-44) 
corresponded to an M.A. range of 11-0 to 12-9. C (scores 45-74) 
Corresponded to an M.A. range of 13-0 to 14-9- C+ (scores 75- 
104) corresponded to an M.A. range of 15 to 16-4. B (scores 105- 
134 corresponded to an M.A. range of 16-5 to 17-9. A (scores 
135-212) corresponded to an M.A. range of 18 and upwards. Now, 
Fie is a formidable implication contained in these data. Since 


N 


sa ont 13 years, and t 
Keen of set ee 
a ative sa k . 

When these eolica Gons were pointed out with some emphasis, 
and their logical consequences bearing on mass intelligence, mass 
Entertainment and the support of democratic institutions were 
elaborated, they came as a violent and sensational shock. Of more 
Importance for students of psychometrics, they were emphatically 
at variance with the results of Terman, who had published data 
Showing that the distribution of intelligence quotients at various 
age levels is approximately normal (1916). Two types of explana- 


tions have been offered. (a) One is to the effect that a on 
~€ Stanford-Bi de for and given to children in school, wi 
aide Sa it iven to adults anywhere 
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/ 
in the items (F. N. Freeman, 1939). (b) The other is to the effect Y 
that intelligence is actually dulled and lowered by disuse and 
facilitated and increased by use (Stoddard, 1943). This, of course, 
involves the rather singular implication that as soon as a person 
leaves school his intelligence ceases to be stimulated, and also it 
raises the broad issue of the effect of environment upon intelli- 
gence. 

F. Success of the work. Finally, it should be noted that the 
Army project was a definitely successful venture in group intelli- 
gence testing on a very large scale. Certain important purposes 
were substantially fulfilled. (a) An intelligence rating was assigned 
to every soldier. (b) A designation was made of those whose intel- 
ligence indicated the desirability of advancement or special assign- 
ment. (c) There was prompt selection and recommendation for 
development battalions of those of inferior intelligence unsuited 
for regular military training. (d) The provision and recording ° 
mental measurements assisted officers in building organizations ° 
uniform mental strength, or in conformity with definite require 
ments. (e) An effective means was provided to aid in the selectio” 
of men for certain types of duty or assignment, such as to military 
training schools, colleges, and technical schools. (£) Data were 
provided for the formation of special training groups within the 
regiment. (g) Means were provided for the early identificatio” 
and elimination of those unfit for military service because of 10W 
intelligence. 

In a sample six-month period .5% of inductees were reported 
for discharge because of mental inferiority, .6% were recon” 
mended for labor battalions because of low intelligence, .6% were 
recommended for assignment to development battalions for 0P” 
servation and preliminary training. Commissioned officers Were 
found located chiefly in the A and B categories of intelligent? 
scores. Those below C— were expected rarely to succeed in office" 
training courses. Noncommissioned officers were chiefly in the C1 
category or above. Tests were admittedly no substitute for es!” 
mates on other criteria, such as character, loyalty, bravery, power 
to command, and so forth. But intelligence as revealed by test 
was regarded as the most important single factor in military 
efficiency. It has been remarked Previously in these pages that th 
ultimate validation of any test is extended practical use. This ® 
as good an example as any of how such a criterion works out. 


K Ts j irst 
Since the work done in the United States Army in the Firs 


oO 
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World War, group testing has been much influenced by its patterns 
and practices but has also developed certain new trends which did 
not appear at that time. We turn now to deal with this further 
development. 


Group TESTS ror CONSIDERABLE AGE RANGES 


A large number of tests, or rather test batteries, have been 
Constructed which are applicable to wide age ranges. Some of the 
most important are the following. 


1. Haggerty Intelligence Examination, Delta 1 and Delta 2 


This test, which was published in 1920, is a good early example 
of the type of instrument here under consideration. Delta 1 is for 
8tades rï to 3. Delta 2 is for grades 3 to 9. This instrument is a 
800d example of competent direct adaptation of Army practice, 
and it has had a wide use. 

Delta r consists of six subtests. (1) Oral directions to be fol- 
lowed, (2) Copying designs. (3) Picture completion. (4) Picture 
Comparison, (5) Digit-symbol. (6) Word comparisons. The first 
four subtests are nonverbal, and have the general character of 
Performance tests. The last two presuppose reading. In each sub- 
test Preliminary exercises for practice are given, with the intention 
of Orienting the pupil to the test and of equalizing preliminary 
eXperience as much as possible. oo, 

Delta 2 also consists of six subtests. (1) Discrimination between 
true and false statements. (2) Arithmetic problems. (3) Picture 
Completion, (4) Discrimination between word pairs as meaning 
the Same or opposite. (5) Common-sense or practical judgment in 
described situations. (6) General information. It is a very direct 
adaptation of Army Alpha, the practice exercises it introduces 

fing the most significant departure. 


2. National Intelligence Tests 
This battery, published in 1920, was constructed by many of the 
Same Persons who developed and conducted the Army testing pro- 
Sram. It has hada great popularity, but is now becoming less used. 
€re are two scales to be used in testing at each level. These 
Come in two separate booklets. A synoptic outline of Scale A is 
Shown in Figure 13. Scale B also consists of five subtests as fol- 
lows: (1) Avithmetic problems. (2) General information. (3) Log- 
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ical judgment. (4) Verbal analogies. (5) Similarities and differ- 
ences among numbers, names, and forms. 

The standardization was very thorough, involving the use of 
about 4,000 cases for each age and grade level. The test yields a 
total numerical score, the highest obtainable being 196. A distinc- 
tive feature is the fore-exercise for each subtest, which runs to 
about half its length. It is somewhat of a question whether “dead” 
practice material of this length does not unduly reduce testing 
time. Also, it has been suggested that such extensive practice may 
in part invalidate the test as an indication of adaptiveness. 

The two batteries so far described are essentially modifications 
of the type of test developed in the Army work. In them there are 
(a) changes in content made with school use in mind; (b) revised 
norms based on standardization with school groups; (c) emphasis 
upon the use of practice material. These are clearly minor modi- 


distance in the way of transformation. 


3. Henmon-Nelson Test of Mental Ability * 


This test appeared originally in 1932. It was extended in 194?) 
and in 1946 new norms and interpretive material were published- 
It is a good example of a group test which begins to depart from 
original Army practice. 

This test is for three levels, grades 3 to 8, 7 to 12, 13 to 16: 

The item selection, which has been mentioned elsewhere in this , 
book, is very careful, and probably yielded as good a final choice i 
as could be made with the general type of material used. The 
authors prepared 250 items and submitted them for criticism tO 
experienced teachers. This reduced the number to 202. These 10 
turn were tried out on soo pupils, and those which correlated best 
with total scores and discriminated best between those known to 
be bright and known to be dull were chosen. This yielded a fihal 
list of 180 items, which were organized into two forms. This 
process of item selection was applied at each level. Two forms 
were originally constructed for each of grades 3 to 8, 7 to 12, and 
13 to 16. Later a third form was added for grades 3 to § an 

to 12. Additional forms are being developed. 

The reliability coefficients reported for all levels are high, run 
ning in the high .80’s and .go’s. Percentile norms for each schoo 

* Reference: Henmon and Nelson. 
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T a 
I. ARITHMETICAL REASONING 
A For example: 
How many cents are six cents and five cents? 
How many nickels make a dollar? 
How many square inches are there in a card 7 inches long by 
6 inches wide? 


2. SENTENCE COMPLETION 
For example: 
Fish swim ........ the water. 
BOYS, ise anatase girls like to ........ ball. 


3. LOGICAL SELECTION 
For example: 
b Draw a line under each of the words that tell what the thing 
y always has 
Table—books, cloth, dishes, top, legs. 
Shoe—button, hook, sole, toe, tongue. 


4. SAME-OPPOSITE 


For example: y 
If the two words mean the same write S. If they are as dif- 


ferent as they can be write D. 


ToS saaran No 
Son .. .. Daughter 
f Light s wawaracs Bright 


5. SymBoL-Diıcır 
For example: 
Make under each drawing the number found under that draw- 
ing in the key. 


ty 


Fic. 13. NATIONAL INTELLIGENCE Test, SCALE A. 
Synoptic OUTLINE 
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grade and for the four college years are given. Also, mental age 
equivalents for the obtained scores are given for elementary school 
and high school groups. The norms are worked out for several 
hundred thousand cases, numbers varying for various levels. The 
test correlates from about .s0 to over .go with other good intelli- 
gence tests. It correlates from .45 to .65 with course marks and 
grade averages in college. 

The Henmon-Nelson battery has two distinctive features which 
mark it as an advance on the psychometric practices established 
in the Army work. (a) It is organized on the “spiral omnibus” or 
“scrambled” plan. The items are of wide variety, including in each 
form at each level such types as information, sentence completion, 
logical selection, classification, verbal analogies, number relation- 
ships, anagrams, disarranged sentences, geometrical analogies, 
proverbs, word meaning, identifying family relationships, and 
arithmetical problems. This range of items is not built into sepa- 
rate subtests, but set up in a mixed sequence. The effect is to gain 
greater ease in administration. (b) Scoring has been rendered 
easier and more speedy. This is done by the use of the Clapp- 
Young Self-Marking Device. The test items are printed on the 
outside of two sheets which have their edges glued together tO 
form a sealed folder. The inside sheet has squares corresponding 
to the spaces for right answers on the test sheet. The back of the 
test sheet is treated with carbon ink, so that the marks made by 
the subject are reproduced on the inner sheet. Thus, in order to 
score, it is only necessary to count the carbon crosses that appear 
inside or outside the right answer squares on the inner sheet. AS 
will be seen, these are departures from Army testing practice, both 
in the interest of efficiency, and neither of them, so far as We know; 
affecting validity or reliability. . 

Percentile norms are given for each grade for which the test 15 
intended. Grade norms are very frequently given in connection 
with tests which, like the one under discussion, are among the best 
and most carefully constructed that we have. However, they are 
always open to some question, for the obvious reason that a 
school grade is an administrative and not a psychological entity- 
The fact that a child is, for instance, in the sixth grade convey 
comparatively little information about his mental or even his 
educational status, so that to rate him on a grade norm may 
quite misleading. However, the test also provides for mental agê 
norms and derived intelligence quotients based on raw scores in 


a 


A: 
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the first instance, and on raw scores and chronological ages in the 
second. This undoubtedly avoids many of the objections to stand- 
ardization on grade levels alone. The criticism of grade norms has 
no great importance where this test is concerned, but it is a 
seneral point, exemplified in this instance, to which the attention 
of the reader should be called, 


4. Otis Group Intelligence Scale * 


This battery consists of an Advanced Examination and a Pri- 
mary Examination, the former being the better known and the 
More widely used. The Advanced Examination’ is intended for 
Subjects of any age so long as they can read, and specifically for 
those above grade 4. It has been widely used in high school, and 
It is said to be not unsuited for some uses in college, although it 
'S too easy for this level. It contains ten subtests. (1) Following 
Printed directions, e.g., indicating the fourth letter of the alpha- 
bet. (2) Verbal opposites. (3) Disarranged sentences to be marked 
true or false. (4) Proverbs to be interpreted, with the proper in- 
'erpretation to be indicated from a choice of statements, (5) Arith- 
metical problems. (6) Geometric figures similar to those shown in 
Figure 1. (7) Verbal analogies. (8) Similarities, in general like 
those shown and cited in Figure 13. (9) Narrative completion. 
(10) Memory, consisting of a story read aloud with questions 


_The primary examination, which was made later, consists of 
eight honreading group tests, and is intended for primary grades 
and kindergarten, ; ; , fap 
Otaw (g.v.) has published regression lines and smon or 
Predicting American Council on Education Test (q.v.) scores from 

tis Group scores. The work was done on 70 subjects who were 
Riven the Otis Test in junior high school and the American Coun- 
cil Test six years later. With Otis scores on the X-axis and Ameri- 
can Council scores on the Y-axis, the equations derived are as 


follows: y — 1.42 X — 67.2: X = .3747 Y + 74-4. 


Otis Self-Administering Tests of Mental Ability 
This well-known and-extremely popular battery, although it 
Was developed quite early, represents a departure from original 
* Publication dates of all tests are cited in the bibliography of tests at the 
“lose of the book. 
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Army practice similar to that of Henmon and Nelson. The Self- 


Administering Tests are for two levels. The Intermediate Level 


is 


for grades 4 to 9. The Advanced Examination is for high school 


and college. Both examinations come in two forms. 
Each form contains 75 items on each level. Their content 
conventional. The “self-administering” feature refers to and 


based upon the “scrambled” or “spiral omnibus” arrangement, an 


is 
is 


instance of which is shown in Figure 14. The test has high relia- 


An electric light is to a candle as a motorcycle is to? 
r.bicycle 2.automobile 3.wheels 4. speed 5.police...... ( ) 


Which one of the words below would come first in the dictionary? 
t.march 2.horse 3. ocean 4.paint 5.elbow 6. night 7, flown t ) 
The daughter of my brother is my? 
r.sister 2.niece 3.cousin 4.aunt 5. granddaughter..... ae CY 
One number is wrong in the following series. Which would that 
number be? 
3453.45 3 Seoeee SENERS grini ( ) 
Which of the five things given below is most like these three: 
Boat, horse, train. 
r.sail 2.row 3.motorcycle 4.move 5.track 
If Paul is taller than Herbert and Paul is shorter than Robert, 
then Robert is (?) Herbert. 
1. taller than 2.shorterthan, 3. just as tallas 4. cannot say ( 


~~ 


A wire is to electricity as (?) is to gas. 
t.aflame 2,aspark 3. hot 4.apipe 5. aStOve...seeeeees ¢ 2 


Fic. 14. INSTANCE OF “SCRAMBLED” ARRANGEMENT OF Test ITEMS 


(From Otis Self-Administering Tests of Mental Ability) 


bilities, coefficients of .92 being reported for the Advanced Exami- 
nation and .95 for the Intermediate Examination. It yields 4 
oint score that can be transformed into mental ages and 50 


called intelligence quotients. The method of establishing the 


Se 


derived values, however, is complex and rather arbitrary. I 
particular the intelligence quotients are not true quotients at all. 
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They are based on the deviation or difference of the subject’s 
Score from the mean score for the age group. If his score is above 
the mean, this deviation is added to roo. If it is below the mean 
the deviation is subtracted from r00. This has a certain similarity 
to W echsler’s method of calculating intelligence quotients, by 
defining ı P.E. below the mean as I.Q. go. It is, however, less 
technically justifiable. And the general objection is the same. A 
term which has had an enormous popular vogue is exploited. 
Scores are reported which have all the appearance of true intelli- 
8ence quotients. In reality, however, they are not quotients in any 
Sense whatsoever, but deviation scores, the true basis of which is 
Concealed. 

. In order to show the misleading character of the measure there 
1S Presented in Table 1 5 an array of scores on the Advanced Exam- 


TABLE 15 


Scores on OTIS Setr-ADMINISTERING TESTS OF MENTAL ABILITY, 
ADVANCED EXAMINATION, WITH STANFoRD-Binet M.A, 
AND Oris I.Q. EQUIVALENTS 


(Adapted from Bingham, Table 40, p. 337) 


Otis Scores Stanford-Binet M.A.’s Otis 1.0.’s 
72 19-3 130 
66 18-6 124 
60 17-9 118 

17-0 II2 
R 16-3 106 
42 15-4 100 
36 14-4 94 
30 13-3 88 
24 12-2 82 
18 II-0 76 

10-0 70 


12 
i | nn See 
ination of the Otis Self-Administering Test, together with their 


€Qivalent Otis 1.Q.’s and Stanford-Binet M.A.’s. In each case there 


1S a decided discrepancy. For instance, a score of 72 on the test 
icates an 1.Q. of 130 anda Stanford-Binet M.A. of 19-3. Taking 
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the C.A. as 16, which is proper in connection with most subjects for 
whom the Advanced Examination would be used, and computing the 
Stanford-Binet I.Q. on this basis, it works out at approximately 
120. In the same way the score of 12 indicates an Otis I.Q. of 70, 
but the Stanford-Binet M.A. of 1o-o yields an I.Q. of approxi- 
mately 62. This latter difference of eight points may not seem so 
very great, but one must remember that it may determine whether 
a subject is tentatively classed as dull-normal, or feeble-minded, 
so it can be more important than it looks. 


6. Otis Quick-Scoring Mental Ability Tests * 


This is yet another, and in some respects superior, instance of 
a departure from Army testing practices along similar lines. Pub- 
lished at a later date than the test just described it is essentially 
a development of it. The battery has three levels. The Alpha Test 
is for grades 1a to 4. The Beta Test is for grades 4 to 9, The 
Gamma Test is for grades 9 to 16. The battery has four forms, 
two of which are machine scorable. 

Each test consists of $0 items. The content is of the conventional 
kind, consisting of analogies, vocabulary, verbal opposites, dis- 
arranged sentences, reasoning, proverbs, and so forth. The form 
of the items, however, is a distinctive feature. All of them are 
thrown into five-choice multiple choice form. This makes for rapid 
response on the part of the subject and also facilitates the 
scoring. 

As noted above, the test can be obtained arranged for machine 
scoring. But even without this, scoring is very easy. All the re- 
sponses to each test appear on a single sheet, so that a punched 
stencil can be laid over it, with no shifting and no turning of pages: 

The norms for the Beta Test are based on about 16,000 cases: 
The same method of computing intelligence quotients as that 
discussed in connection with the previous battery occurs 
here. 

Distinctive features of this test are its brevity and scorability- 
Yet it is definitely related to the much more extensive American 
Council on Education Psychological Examination for College 
Freshmen. Weber (q.v.) has worked out and published regressio? 
lines which make it possible to read from the score obtained 0 
either one to the probable corresponding score on the other. | 

The name of Otis has long been associated with workmanlike: 

* See Otis (1918) for his ideas on test construction. 
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F Particularly the quick-scoring tests. Also they are easy to ad- 
minister, and take very little time in comparison to some others, 
The testing time required for these two examples ranges from 20.” 
to 30 minutes. Here a question that naturally suggests itself is 
whether a reasonably valid and dependable score can be obtained 
SO rapidly. There is good evidence that it can. The batteries show 
about the same correlations with other intelligence tests as those 

| ordinarily expected. Moreover, Bingham (g.v.) has reported a 

| correlation of .793 between the Otis Self-Administering Tests, Ad- 
vanced Examination, and the Scholastic Aptitude Test of the _ 
i College Entrance Board, which requires three hours of testing time, < 
$ di ot at least six times as much as the former. A full set of Scholas- 

P tic Aptitude Test equivalents for Otis scores is shown in Table 16. 
Note that the top Otis score is at the g2nd percentile S.A.T., 
Indicating that the Otis Test is too easy for this group. 


7. Pintner General Ability Tests: Verbal Series 


This is a sequential battery, built partly from revisions of 


earlier tests, which appeared in 1939. It is for three levels: that 
for kindergarten to 2nd grade is the older Pintner-Cunningham 


Primary Test; that for sth to 8th grade is a revision of the Pintner 
Intelligence Test; that from the gth to the r2th grade is new. The 
T attery comes in two forms. 

The Intermediate Test consists of 8 subtests as follows (Form 
1) Vocabulary, which gives items calling for choice between 
five alternatives to match the meaning of the stimulus words. 
(2) Logical selection, to tell what a thing “always has,” e.g., 
forest snow, trees, beasts, a forester, hunters. (3) Number 
Sequence, i.e., choosing a number to finish an incomplete series. 
(4) Best answer, e.g., five reasons for using a knife, the best to be 
Chosen, (5) Classification, the items consisting of groups of five 
Words with one not belonging with the others. (6) Verbal oppo- 
Sites. (7) Analogies. (8) Arithmetical reasoning. 
highly distinctive feature of this test is that it yields not only 
a “global” or over-all score, but a set of profile scores on the sub- 
tests, This exemplifies a major development in test construction, 
Which has become more and more important in the past ten 


years, 


A 
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TABLE 16 


PREDICTION OF SCORES ON SCHOLASTIC APTITUDE Test (3 Hours) FROM 
Oris SELF-ADMINISTERING Test, HicHer, Form A (30 Minutes) 


(Quoted from Bingham, Table 41, p. 339) 


Otis \ S.A.T. | Centile | Letter || Otis SAT: | Centile | Letter 
Score | Score | Rank | Grade || Score | Score Rank | Grade’ 

15 647 92 B 50 420 21 D 
74 638 gr B 49 411 18 D 
73 628 go B 48 4o2 16 D 
72 619 88 B 47 393 14 D 
7I 610 86 B 46 384 12 D 
70 or 84 B 45 375 10 D 
69 592 82 B 44 366 8 D 
68 583 79 B 43 356 7 D 
67 574 77 B 42 347 6 E 
66 565 74 B 41 338 5 E 
65 556 7 B 40 329 4 E 
64 547 68 c 39 320 3 E 
63 538 64 [e4 38 3It 2 E 
62 529 61 (23 37 302 2 E 
61 520 58 é 36 293 E E 
60 grr 54 c 35 284 I E 
59 502 50 Ç 34 275 I E 
58 492 46 c 33 266 9 E 
57 483 43 c 32 257 7 E 
56 472 39 c 31 248 5 E 
55 465 36 c 30 239 4 E 
54 456 32 c 29 230 3 E 
53 447 29 D 28 220 32 E 
52 438 26 D 27 211 Žž E 
51 429 24 D 26 202 az E 

25 193 a E 
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8. Kuhlmann-Anderson Intelligence Tests * 

This very significant test, which has gone through five revisions 
since its first appearance, marks the most decisive departure so 
far encountered from earlier practice. It has a range of application 
from grades x to 12. It comes in nine booklets, each containing 
To to r2 subtests, the various booklets being for designated age and 
8rade levels. Thus each child is always given 10 subtests. There 
are, in all, 39 different subtests in the battery, many recurring at 
different age levels. These subtests involve the use of pictures, 
8eometrical figures, mathematics, new associations, and verbal 
relations and information. They were selected from a tentative 
list of roo possibilities, on the basis of definite increase in 
Scores attained at successive age levels. This criterion is much 
emphasized by Kuhlmann. A unique feature of the scoring is that 
Mental ages are obtained from tables of equivalents for each sub- 
test, and the M.A. of the subject is his median subtest M.A. 

erformance can also be expressed in Heinis’ Mental Units, 

Oynton reports that the test correlates as high with the Herring- 
Binet as the latter does with the Stanford-Binet, and also that the 
test is an excellent one for the identification of unusually bright 
Pupils, For references see the test manual which has an extended 
discussion of general problems, and also R. G. Anderson. 


Tests ror Hicu SCHOOL AND COLLEGE LEVELS 
A large number of tests have been developed which differ from 
the foregoing in being intended for the secondary and college 
levels, that is, for more limited age ranges. The distinction is not 
“eD-going, and does not involve any new principles or practices, 
and these tests are considered separately merely for the sake of 
© atity of exposition. Later on, when we turn to deal with tests 
for Young children, and also for adults without reference to edu- 
cational placement and status, it will be found that new problems 
are indeed encountered. 
L Terman Group Test of Mental Ability i ; 
i . . he general pattern o 
aS: “fi hed in 1920, follows the g y 
Army an Ha of eerie as follows. ( 1) Information. 
(2) Best answers tä questions involving the interpretation of 
Proverbs and matters of fact. (3) Word meanings and opposites. 


Reference: R. G. Anderson; Garrett, 1941- 
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(4) Logical selection. (5) Arithmetical problems. (6) Sentence 
meaning, each sentence containing a concept the understanding 
of which determines the response. (7) Analogies. (8) Scrambled 
sentences. (9) Classification. (10) Number series completion. It 
requires 35 minutes to administer. 

It consists, in all, of 185 items in each of its two forms. These 
items were selected from an original list of 886, the criterion being 
power to differentiate between persons known to be bright and 
persons known to be dull. Percentile norms are presented for each 
grade from 7 through 12. They are worked out on a standardiza- 
tion group of 41,241 White children, with from 4,000 to 10,000 
at each grade level. This large standardization group was drawn 
chiefly from city schools, two-thirds of them coming from Cali- 
fornia. The result is the development of norms which are probably 
somewhat high. A table of mental age equivalents for raw scores 
is included in the manual. 

This is one of the earliest good group intelligence tests for use 
at the secondary level. It lacks the characteristic “efficiency 
features which have developed since 1920, when it was published, 
but it is easy to administer and easy to score. The use of percentile 
grade norms is open to some objection. However, the test stil 
stands up well. For general classification and for the prediction ° 
academic success it is a useful instrument. It is highly verbal 19 
content, and its relationship to success in trade schools and indus 
trial schools is not so clear. Moreover, it has some weakness in its 
power to discriminate between the average and the superior. For 
instance, subtest 2 (best answers) is so easy that of 1,146 collegê 


men and 628 college women 45% and 35%, respectively, M4 
perfect scores (Boynton). 


2. Terman-McNemar Test of Mental Ability * 


This test, published in 1941, is a development of the Terma” 
Group Test of Mental Ability. It consists of seven subtest® 
namely, information, synonyms, logical selection, classificatio™ 
analogies, opposites, and best answer. In constructing it, the 
man Group Test was used as a reservoir of items, and enous 
more were added for three experimental tests of the same lengt 
as the present one. The three experimental forms were give”, g 
groups of 7th, 9th, and rrth grade pupils, 400 usable cases ae 
obtained in all. Item difficulties were computed for each g3 


* Reference: Tyler. 


ee 
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Nondiscriminating items were eliminated. Item validities wer 
Computed. The test was carefully equated to the Terman Grams 
Test. It is available in two forms, C and D, so designated to be E 
Sequence with the two forms of the Terman Group Test. Split-half- 
reliability for grades 7 to 9 is .96. The standard deviation of the 
k v Scores 1S 25.69. A table of mental age equivalents for raw scores 
1S provided. The manual states that I.Q.’s may be computed in the 
usual manner, i.e., by dividing mental by chronological age. It 
recommends the use of a modified chronological age beyond i, 
with one month of age dropped for each 3 months of life. The user 
= cautioned that M.A.’s beyond 16 are scores, and not true 
a ages. Hand- and machine-scoring procedures are pro- 


3. Thorndike Intelligence Examination for High School 
Graduates 
_ One Practical weakness and limitation of almost all the tests 
discussed so far—Army Alpha, the advanced sections of the Otis 
atteries, the Terman Group Test—is that they are too easy to be 
effective with the abler high school graduates and college students. 
When Such tests are applied to groups of this kind, a great many 
Subjects reach the “ceiling,” so that discrimination is not adequate.» 
Thorndike Intelligence Examination was one of the first 
8toup tests designed to meet this problem. 

Ttis a lengthy and difficult test, requiring two hours and fifty , 
Minutes of actual testing time. It comes in four booklets. The 
‘st contains practice material, consisting of samples of all types 
Í items to be used later, for which 15 minutes is allowed. The 
Second booklet contains subtests involving directions, arithmetical 
Problems, information, opposites, word meanings, and so forth, 
and has a time limit af 45 minutes. The third booklet contains 
subtests involving sentence completion, algebra problems, and 
Rformation, and has a time limit of so minutes. The fourth book- 
Consists of questions calling for the interpretation of difficult 

Prose Passages, and has a time limit of one hour. — 
at _totndike (1920, 1921 b) and Wood (q.v.) describe and discuss 
Considerable length the use of this test as part of the admissions 
th ürements at Columbia University. It was found valuable by 
© ad Ministrative authorities, and in conjunction with other cri- 
rig Predicted college success approximately to a correlation in 


the Order of 60. 
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4. American Council on Education Psychological Examination | 
for College Freshmen * 4 


This test has largely superseded the preceding, though the latter 
still finds some uses. 

Quite apart from the outstanding competence of the work of 
construction, one of the great values of the American Council test | 
is that it is revised yearly and has been since 1924, and that each | 
year large amounts of data from the several hundred institutions 
where it is given are assembled, collated, and reported. This means 
that it embodies what in the opinion of its authors are the best 
practices currently available for tests of its general character, and 
that obtained scores can be interpreted along various lines with 
much confidence. 

An outline of the 1946 edition of the test is presented in Figure 
15. It comes in three forms, one for hand scoring, another for ~ 


A 
Dai 


ee aa 


QUANTITATIVE TESTS 
(the Q score) 


1. Arithmetic 
2. Number Series 
3. Figure Analysis 


LINGUISTIC TESTS ; 
(the L score) > 
4. Same-Opposite 


5. Language Completion 
6. Verbal Analogies 


ee 
Fic. 15. OUTLINE oF AMERICAN COUNCIL ON EDUCATION PsycHotocicaY’ 
EXAMINATION FOR COLLEGE FRESHMAN, 1946 EDITION 


(Thurstone and Thurstone, 1947) 


machine scoring, and another with a separate answer sheet for 

machine scoring. As will be seen from Figure 15, it consists of 

subtests divided into two parts. It yields two chief subscores, a 

“Q” score based on responses to subtests requiring quantitative 

insight, and an “L” score based on responses to subtests requiring 
k Reference: Thurstone and Thurstone, 1945, 1947. 


~~ 
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language responses. These two subscores, together with the total 
Score on the entire test, are recommended for purposes of student 
Suidance. The scores on the 6 separate subtests should not be so 
used. Here again we have an instance of the profile scoring so im- 
Portant at the present day. Thurstone and Thurstone are very em- 
Phatic in their warning that the test does not yield either mental 
ages or intelligence quotients. They point out that scores of this 
kind are meaningful only within a certain age range, and that they 
do not apply properly to college students. It is interesting that 
Such a caution should be thought necessary for the presumably 
well-instructed persons likely to use and interpret this test. The 
test embodies a great deal of practice material, which takes about 
One-third of the total time. This feature has been adversely 
Criticized, 

Beginning with 1940 an analysis of item difficulty has been set 
UP, so that the gross scores on successive editions of the test since 
then are comparable. Among the many interesting items contained 
in the data presented by Thurstone and Thurstone in their 1947 
report, it appears that mean scores for all institutions using the 
1944 edition range from 128.44 for the highest to 33.42 for the 
lowest. Both are four-year colleges. This fantastic range of average 
Student ability is obviously full of formidable implications for 

Merican higher education. 

5. American Council on Education Psychological Examination 
for High School Students * 


This is another test put out at less regular in 
oregoing, by the American Council on Educati À 

€ test for college freshmen in general, but it is easier. The 1937 
edition consists of 4 subtests: (1) Completion, consisting of 55 
Items and taking 14 minutes ; (2) Arithmetical problems, 20 items, 
20 minutes; (3) Analogies, 29 items, 19 minutes; (4) Opposites, 
54 items, 6 minutes. 


6. Ohio State University Psychological Test 
Here is another test which has appeared in a series of editions 
ever a period of years, and regarding which interpretive data have 
en systematically accumulated. A brief synopsis of the 1943-44 
“dition appears in Figure 16. The test is very well known and 
Widely used, Hartson (g.v.) presents some analysis of its relia- 


* Reference: Thurstone and Thurstone, 1937- 


ntervals than the 
on. It resembles 
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bility and validity based on data obtained from its use at Oberlin 
College. He finds that its correlation with freshman scholarship 
ranges from .429 to .696. 


t. VERBAL OPPOSITES 
Thirty stimulus words followed by five other words each, task 
being to indicate the opposite of the stimulus word in each case. 


2. VERBAL AGREEMENT 
Sixty items, each consisting first of two words which establish a 
relationship (e.g., both being plurals, noun-adjective relation- 
ship, etc.), then a third word, followed by five words, task being 
to indicate which of the five is related to the third as the first 
is to the second. 


3. PARAGRAPH READING 


Nine paragraphs, with four to ten questions on each. Literary: 
scientific, and mathematical material is used. 


Fic. 16. OUTLINE OF THE OHIO STATE University PSYCHOLOGICAL 
Test, REVISION 22, 1943-44 


One or two comments should be made about tests designed fO" 
secondary and college levels, among the best of which the above 
six are representative samples. 

First, their special orientation obviously makes the problem of 
validation more manageable than it would otherwise be. At th® 
same time, one should be on one’s guard against regarding them 
simply as special aptitude tests. The authors of the America" 
Council Test specifically maintain that it reveals certain genet 
psychological factors; to wit, quantitative ability and languag? 
ability. And it can be argued that the others reveal general 1" 
telligence in a special setting and manifesting itself in speci 
groups. If this is the case, they cannot be regarded as aptitut 
tests pure and simple. , 

Another noteworthy consideration is that such tests are being 
developed by nonprofit organizations, such as the American Cou” 
cil on Education and the Ohio College Association Committee ne 
Intelligence Tests for College Entrance. This makes possible ta 
publication of successive improved editions, the systematic ae 
mulation of interpretive data over long periods of time, and 


TESTS OF INTELLIGENCE 167 


Po al in the instruments of Statistical procedures and psy- 
a Pt st Irrespective of marketability, all of which 
These: price values indeed. The specific repudiation by 
4866 ond | ne Thurstone of Interpretations in terms of mental 
the ave | om ‘gence quotients, which are certainly misleading for 
wikia. pa concerned in spite of their popular appeal, is most 
deni S- Also, their construction of a test which yields two 
er wea instead of relying entirely on the usual global 
ica score is a development of moment. It is at least a 

buildin my Sa a successful one or not so far, towards the 
ogica] So Pevenometrie instruments capable of genuine psycho- 
rom n analysis, Such developments are what one might expect 
Pde oat organizations which can commit themselves to the 
ae — of the best possible tests. Unfortunately this hope is 
sity -3s realized, for at least one of the major endowed univer- 
y presses advertises its tests in the spirit of the vendor of patent 


medicines, 


PERFORMANCE TESTS AND SCALES 


« These are tests in which the tasks set up require the subject to \ 
a mao ething” rather than to make a verbal response, e.g., to solve 
appro to assemble a pattern of blocks, to fit cutouts into the 
Done holes in a form board, to assemble and put together 
Select. eS presented part-wise on pieces of wood or card, perhaps to 
form a Proper implement for an indicated task, and so on. Per- 
the ance items and subtests appeared in the work of Binet from 
Dr very first, and they appear in all the scales discussed in the 
eceding chapter, except the CAVD. The Army work, and more 
Particularly the Army Beta Test, gave a special impulsion to this 
and. pment and led to the construction of numerous group tests 
Scales consisting entirely of performance items. i $ 
€ purpose is to get away from the verbalism of such intelli- 
pence tests as have been described, and to avoid its working 
'Mitation of applicability to those who can use English readily. 
n many of them language is used to a considerable extent, but 
also Some are wholly nonlanguage, instructions being conveyed by 
monstration. A performance test often includes items very like» 
those in a test of mechanical or manual aptitude, but its purpose 
Is difi erent. Its aim is to measure general mentality in and through 


fi ; é 5 
anipulation, rather than ability to manipulate in and of itself, 
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Five performance scales are presented below in chronological 
order, and then one test of a special and unusual kind is described, 
to which a good deal of attention has been given. 4 


1. Pintner-Paterson Scale of Performance Tests 


This is a pioneer performance scale, published in 1917, and 
abbreviated and revised in 1937. It is for individual administra- 
tion and consists of 15 subtests as follows. 

(1) Mare and Foal Test. This consists of a picture of a farm- 
yard showing among other things a mare and foal, with rr pieces 
cut out. The task is to assemble the pieces into a picture. It is 
scored by time taken in seconds up to 5 minutes and by the num- 
ber of errors. The same scoring is used in the next ro subtests. 
(2)Seguin Form Board. This is a board 20 x 143% inches from 
which ro geometrical shapes have been cut out. These are to be g 
fitted into the proper apertures in the board. (3) Five Figure” 
Form Board. This is similar to the Seguin Form Board but mor® 
difficult, because 11 pieces must be fitted into 5 apertures. (4) Two- 
Figure Form Board. Similar to the above but easier. (5) Casuist 
Form Board, considerably harder than the above two, requiring 
12 pieces to be fitted into 4 holes. (6, 7,8) Somewhat similar form 
boards, with varying numbers of apertures and pieces. (9) Maniki” 
Test. This consists of a doll in 6 pieces. The arms, legs, etc., are t° 
be fitted into place, but the holes in the body for fitting togethet 
the various parts differ in shape. (10) Feature Profile Test. Thi 
consists of 8 pieces out of which to form the profile of features 
(11) Ship Test. This consists of a ship picture in ro rectanguay 
pieces to be fitted together. (12) Picture Completion Test. This 
consists of a picture of a rural scene or scenes with ro squares cul 
out, the task being to fill in by selecting the most suitable pictures 
out of several. The score originally was the number of blan 
correctly filled out in ro minutes, but 5 minutes was found usually 
to be ample time, so the test was restandardized for 5 minutes: 
(13) Substitution Test. This consists of rows of geometrical figure? 
to be marked with numbers according to a given key. The sco 
is the time required to mark 50. (14) Adaptation Board. A boat 
with 4 round holes, three of them 6.8 cm. in diameter, and thé 
other 7.0 cm. The subject is shown how one block fits exactly int 
the largest hole, and then told to put it into the right hole, tP 
board being placed in four positions. The score is the number 
right tries. (15) Cube Imitation Test. Consists of 5 black r-in¢ 
cubes. Four of them are placed in a row about 2 inches apart * 
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front of the subject. The examiner taps the four with the fifth 
Cube in various orders at a rate of about 1 per second. The task 
of the subject is to watch and then imitate. The score is the num- 
ber of correct tries. 

The scale yields a point score. Percentiles are worked out for 
each level. Tables make possible the computation of mental age 
for each subtest. It is suggested that the median mental age on 
all subtests be considered as a single representative mental age. 

Most of these items are in the nature of stock material for per- 
formance scales and tests, and have been used again and again 
either in the identical form here described or with minor vari- 
ations. An appraisal of them therefore carries far beyond this 
Particular instrument. 

(a) It is noteworthy that speed is an important factor in 12 out 
of the 15 subtests. This raises some question of their power to 
discriminate valid intellectual responses or responses calling for 
8enera] intelligence. Moreover, very slight differences in speed 
affect the score and may determine the difference between age 
leve] classifications. (b) Throughout the scale manipulative dex- 
terity and the control of small movement is involved. That this 
Opens the same question once again seems clear. (c) It is quite 
true that a systematic method of working will tend to lower the 
time taken and to increase speed. Here perhaps is the best reason 
Or considering these items as significant signs of general intelli- 
Sence, (d) In the Cube Test and the Substitution Test (13 and 
15) immediate memory span is involved. E. B. Greene (q.v.), 
rom whom these evaluative comments are derived, points out 
that they are based simply on inspection and conjecture, and not 
On statistical analysis. ' 

Some further light is thrown upon the meaning and general 
Validity of this instrument by an investigation reported by Mc- 

ray (q.v.) Fifty children at or above 130 I.Q. on the Stanford- 
Binet scale and 5o children with I.Q.’s from 75 to 90 were com- 


Pared on the Pintner-Paterson scale. The resulting mental ages \ 


Were : A right group and spuriously low for 
the du. ae a de adaptation and achievement 
Were markedly wrong. It is to be noted, however, that the subjects 
in this study were young and that they consisted of children who 
deviated markedly from the average. The scale is „considered a 
Valuable supplement to highly verbal scales, but it is not a good 
Substitute for them. It is particularly useful for the deaf, for those 
Who do not understand English, and for certain types of emo- 
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tionally disturbed subjects. It is not very satisfactory for older 
children, and older children who are dull often achieve high ratings 
which are probably spurious because of their manipulative facility 4 
(A. W. Brown 1941, Freeman, 1939). 


2. Point Scale of Performance Tests (Arthur) * 


This is also an instrument for individual administration, in- 
tended for ages from 6 years upwards. It is closely similar to the 
foregoing. In fact, Arthur restandardized all but three of the | 
subtests of the Pintner-Paterson scale, the omissions being num- 
bers 8, 13, and 14. For an account of the processes of restandard- 
ization, see Arthur, 1933; and for interpretations of the scale and 
its results, see Arthur, 1930. 

Arthur introduced into her scale two subtests which do not 
appear in the instrument discussed above. They are the Porteus 
Maze Test (v. Porteus, 1915, 1924), and the Kohs Block Design 
Test (v. Kohs). The Porteus test consists of rr mazes of increas” 
ing difficulty. They are to be traced in pencil, which must not 
cross any line. If a line is crossed, the maze is withdrawn and @ 
duplicate is given for another attempt. For the simpler mazes tW? 
trials are allowed, but for those intended for the 12- and rs-yea™™ 
old levels four trials are allowed. A success obtained on a trial 
later than the first counts less on the score than a success on thé 
first trial with any maze. The test yields a total credit in terms ° 
mental age. There is no time limit. Speed is not a factor, ye 
Porteus believes the task to be indicative of prudent and carefl, % 
foresight and choice. The Kohs Block Design Test consists 0 
designs presented on 17 cards to be duplicated with colored CU 
blocks. All the cube blocks are identical, with four of their 5! ‘ 
cotored blue, yellow, red, and white, one side divided horizontal y 
between blue and yellow, and one side divided horizontally be 
tween red and white. From 6 to 16 blocks are needed to duplic* 
the designs as they increase in complexity. The test is scored y 
speed. Both these subtests are very familiar and widely used. The 
appear in many performance batteries, and they have been adapt 
and re-edited in many ways. i 

The Arthur scale is set up in two forms identical in dificult 
Form 1 was standardized on 1,125 children ranging from 5 wae y 
years old. Mental age norms were worked out with 574 of eg 
children for whom I.Q.’s had been obtained either on the stanl or 
Binet scale or the Kuhlmann-Binet scale. The criterion use 

* References: Arthur, 1930, 1933; Hilden and Skee!s 


4 
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the Selection and inclusion of subtests was their power to dis- 
criminate between the bright and the dull. Those subtests dis- 
criminating most markedly were given the most weight in com- 
Puting the total score. Arthur reports that the Porteus Maze Test, 
and the Cube Test from the Pintner-Paterson scale showed the 
best Capacity for discrimination, 
he general evaluative comments on this scale are substantially 
the same as those on the Pintner-Paterson scale. Wallin (1946), in 
One of the few studies comparing this test with the Stanford- 
inet, reports a correlation of -72 with the 1916 Revision, using 
290 Cases, and one of -53 With the 1937 Revision, using 172 cases. 
3 his is an unusually high relationship between tests of the type 
MVolved, 


3. Cornell-Coxe Performance Ability Scale * 


This scale Consists of 6 subtests of performance type, with a 
Seventh as an optional substitute for the third. They are as follows: 
Manikin Profile Test, (2) Kohs Block Design Test, (3) Pic- 
ture Arrangement Test, (4) Digit-Symbol Test, (5) Memory for 
Jesigns Test, (6) Cube Construction Test, (7) Picture Comple- 
tion Test, 

The instrument has been standardized on 306 cases extending 
from kindergarten through the eighth grade. This seems a meager 
Sampling for such an extended range. The authors adopt a curious 
and, so far as the present writer is aware, a unique way of deter- 
Mining mental ages. For them a mental age is neither the median 
Of the Scores of a given age group nor the median of the ages of 
those Making a given score. Rather, it is a somewhat arbitrarily 

termined median between these two values which makes it 

ecided]y questionable and ambiguous. In its present form the 
Scale js at best a possible supplement to other performance scales, 
and Might be used as a supplement to the more general scales 
discussed earlier in this chapter. , ; ; 

The question which evidently arises in connection with per- 
formance scales is whether and to what extent they are valid as 
measures of general intelligence. ‘The evidence on this point will 

© Presented Fater in another connection. For the moment it may 
© said that they show only medium to low correlations with the 
Customary criteria, such as results on standard intelligence tests, 
Schoo achievement and the like (v. Gaw). As instruments for 
dependent dse their utility is limited. They often have clinical 


+ 
Reference: Cornell and Coxe- 
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value, however, and may indicate certain types of mental and 
emotional disturbance. Also, they are often useful suppleme 
for standard instruments for the measurement of intelligence. pa 
the general view is that at least they embody a significantly dit- 
ferent conception of intelligence from that represented in the 
standard instruments previously considered and those to be dis- 
cussed later. In view of their emphasis upon speed, manipulative 
dexterity, neural control, visual memory, and the like, all these 


conclusions, which are supported by considerable statistical ev 
dence, seem reasonable enough. 


4. Chicago Non-Verbal Examination 


The test comes in one form, and is for ages 7 to adult. It con- 
tains 10 subtests as follows. (1) Digit-symbol, easy learning: 
(2) Indicating incongruous objects in a pictorial representation 
of a series of objects. (3) Counting the numbers of cubes in Pa 
tured piles. (4) Duplicating a given shape by selecting appropriate 
segments pictorially shown. (5) Selecting from a series of design’ 
one like a given design. (6) Arranging parts of a picture to mak 
it complete. (7) Arranging a pictured series of events to show ? 
sequence in time, e.g., series of pictures of catching and losing © 
fish. (8) Showing the thing wrong in a series of pictures. (9) z 
lecting from a set of pictures the one that goes with a given pictur’ 
(10) Digit-symbol, difficult learning. e 

Norms were established on 1,844 hearing children. Mental a8 f 
‘percentile, and standard score norms up to the age of 14 are give 3 
Beyond that level there are no M.A.’s. Reliabilities of .80 t° :? 
are reported for groups ranging through distributions of 2 24 h 
years C.A. and from 2 to 6 grades. These are probably not t 
enough for intragrade or intra-year comparisons. Four validi 2 
criteria were adopted—correlation with chronological ag; “of 
crimination of normal from feeble-minded children, norma” Y se 
the distribution of scores, correlation with other tests. On est 
criteria the validity is reported as “reasonably good.” The k 
purports to yield a global measurement of “nonverbal aspects, 3, 
intelligence.” One point to be noted is that the pictorial mater! j 
so important in this instrument, is often very badly reproduce J 


5, Pintner General Ability Tests: Non-Language Series 


al 
This battery parallels the Pintner General Ability Tests: Vet. 
Series. Its general layout is similar. It assimilates, with cons! 


: 4 À ests 
able modification, the earlier Pintner Non-Language Mental 
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which was, up to the time of its publication, one of the most com- 
Prehensive and excellent of nonlanguage tests. A synoptic outline 
of the Intermediate Test is shown in Figure 17. 

Pintner’s (1924) discussion of the earlier test probably applies 
quite well here also. Correlations with verbal tests are not high, 
Tunning from .25 to .72. Correlation with a criterion of intelligence 


I. FIGURE DRAWING 
A geometrical figure is shown complete and then cut up. Task 
is to choose the one of 5 lines that would cut the complete figure 
as shown. 


2. REVERSE DRAWINGS 
A series of items each consisting of a geometrical figure shown 
complete and then reversed and with one line missing. Task is 
to choose the one of 4 lines which is the one missing. 


3- PATTERN SYNTHESIS 
Items consisting of 2 geometrical figures indentical in outline but 
with different internal segments shaded. Task is to imagine the 
first superimposed on the second and to tell which of 4 designs 


it would then resemble. 

4 MOVEMENT SEQUENCES 
Items each consisting of three figures with movement in given 
direction indicated by arrows. Task is to tell where the figure 
would be if the movement continued beyond the point shown 


in the third figure by choosing from 4 diagrams. 


5. MANIKIN 
Manikin figure shown in various positions and postures, The 
Position of the arms in the first figure in each item to be matched 
from 4 others showing manikin upside down, etc. 


6. PAPER FOLDING ae l 
Drawings of sheets of paper folded in various ways with small 
segments cut out. Task is to tell what each sheet would look like 


unfolded by choosing among 4 drawings. 


i : -LANGUAGE SERIES; 
IG. 17. Pinner GENERAL ABILITY Tests: Non-Lancuace SERIES; 
INTERMEDIATE LEVEL. 

Synopric OUTLINE 
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constituted by a composite of chronological age, school marks, 
teachers’ estimates, school progress, and four other intelligence 
tests was .78 for 235 children in grades 2 to 4. Correlation with 
teachers’ estimates alone was about o. 


6. Drawing a Man * 


This test, intended for ages from 314 to 13 years, has attracted 
a good deal of attention because of its novelty. Goodenough con- 
cluded from her own work and from the investigations of others 
that drawing can be an indication of intelligence. This is the idea 
embodied in the present test. Instructions to the subject are as 
follows: “On these papers I want you to make a picture of a man. 
Make the very best picture that you can. Take your time and 
work very carefully. I want to see whether the boys and girls in 
paser school can do as well as those in other schools. Try very 
hard and see what good pictures you can make.” The scoring 
depends’ on the presence of certain items, such as legs, attache 
legs, nose, fingers, etc., and not on art quality. The figure O 
man was chosen betause of its familiarity. Proportion and per 
spective as well as enumerated parts are credited in the scoring 
The points emphasized in scoring were chosen because they sho“ 
a regular increase with age, and differentiate between childre® 
of the same age but in different school grades. The total possib 
score is 51. Instances of means are as follows: For C.A. 3⁄2 scor 
of 2; for C.A. 4%, score of 6; for C.A. 51⁄4, score of 10; for ce 
13⁄4, score of 42. Obtained reliabilities run from .77 to -93> dy 
the scoring seems decidedly subjective. McCarthy (q.v.), in & stu 1 
of the reliability of this test, gave it twice at a one week interv i 


ut $ 


to 386 3rd and 4th grade children. Each test was scored thr 
times, twice by the same scorer, and once by another. The corre 
tion of-scorings by the same person was .94, and 12.4% © 

cases yielded a difference of one year or more M.A. Scoring’ 
different persons correlated .go, but 25.3% of cases differed 
year or more. An odd-even reliability of .89 was obtained. Jn 8 
eral it does not correlate highly with other intelligence tests d 
one study (McHugh, 1945) correlations of .45 + .06 on MAS? 
4t + .06 for 1.Q.’s were obtained with the Stanford-Binet 19% ¢ 
These coefficients are higher than others reported. There is © tal 
reason to believe that it is definitely affected by environm’ A 
influences, for Indian children, particularly boys, from 6 to 1% i 

* Reference: Goodenough, 1926. 
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Proved decidedly superior to whites, which js thought to be due 
to emphasis on visual values in the upbringing of Indian children 
avighurst, Gunther, and Pratt). 
From the examples here Presented, which, although few in 
number, are representative of the best work of its kind, the attempt 


Beta, and to build group intelligence tests of performance type 
1S mucl less happy and successful than in the case of group intelli- 
Sence tests. The items are often forced and trivial and of dubious 


instance, the imaginative fitting together of Separate geometrical 
Sures is not apt to occur in real life, Digit-symbol substitutions, 
Which are frequently used, may be suitable for code workers, but 
ave little relevance to the doings of most People. Also, it rarely 
aPpens that one is called upon to detect incongruities in pictured 
Scenes, Jt may well be, as Porteus (1924) argues, that maze tests 
Which are also used, though they do not occur in the above ex- 
amples, can indicate continuous adaptation and planning and 
self-criticism which are among the recognized attributes of intelli- 
Bence, But as Porteus also remarks, these values are nullified 
When, as often happens, maze tests are run under a time limit. 
© the tests of the kind under consideration are mostly far afield 
rom the descriptions of intelligence advanced by Stoddard and 
Boynton; and since item difficulty is not much considered in their 
organization, they are also unrelated to Thorndike’s description, 
with its emphasis upon altitude. Certainly they do not coincide 
Closely with the two former descriptions, which make much of 
Mig importance and relevance of motivation, and of the intrinsic 
"™Portance of the tasks intended to reveal intelligence. As a 
Secondary but not unimportant point, it must be said that the 
Pletoriap Material so largely used is often of atrocious quality, so 
the ne cannot really tell what is represented, and all sorts of 
Preposterous interpretations are suggested. Moreover, they are 
“formance” tests only by courtesy, for there are few paper- 
n d-penci] tasks which are really direct “doings. Making de- 
Clsio about pictures of piles of blocks is not a parallel for 


s 
Actually handling the blocks, for instance. 


SUGGESTED AppITIONAL READINGS 


For additional reading and more intensive study of the material in 
this Chapter the most important sources are the tests and more par- 
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ticularly the manuals of the tests discussed. Publishers will be found 
listed in the bibliography of tests at the end of the book. Also the 
references mentioned in the text in connection with the various tests ~~ 
may be consulted. Further readings are suggested as follows: 

Raymond B. Cattell and William Moodie, A guide to mental test- 
ing (London: University of London Press, Ltd., 1936), Chapter 7, 
«Notes on the selection of tests, interpretation of results, and syn- 
thesis of evidence.” A practically valuable general chapter. 

Paul L. Boynton, Intelligence, its manifestations and measure- 
ment (New York: D. Appleton-Century Company, 1933), pp. 210 ff., 
“Criticism of group tests.” Raises many general critica] issues. 

Edward B. Greene, Measurements of human behavior (New York: 
The Odyssey Press, 1941), Chapter 12, “Performance, mechanical and 
motor tests.” Descriptions and discussions of many tests. 

Ethel L. Cornell and Warren W. Coxe, A performance ability scale 
(Yonkers-on-Hudson, N. Y.: World Book Co., 1934). Valuable gen- D 
eral discussions as well as an account of the scale. fm 

Oscar Krisen Buros (editor), The 1938 mental measurements year 
book (New Brunswick, N. J.: Rutgers University Press, 1938)) 
also The 1940 mental measurements yearbook (Highland Park, N.J: 
The Mental Measurements Yearbook, 1941). These two reference 
works contain a mine of information and comment in regard to many 
tests. 


Questions For DISCUSSION | 


1. To what extent does the fact that a subtest or item aor 
increasing scores or percents passing with age establish its validi j 
for intelligence? . this 

2. Bring together all the evidence on validity presented in t 
chapter and consider just how much it proves. 

3. What explanation of the low mean mental age of the nat 
White draft in the First World War seems reasonable? st 
4. Consider some important differences that a given global s 
score obtained by several individuals might conceal. dis 

5. Among the various items in the “performance type” tests a) 
cussed, which seem to you true performance items? Can a paper” 
pencil item ever be a true performance item? sod it 

6. What different conception of intelligence might be embodied a 
verbal and in performance type tests? Would the difference “ pe -} 
matter of the author’s theory, or have to do with the sort of item? 
chose? ‘off 
7. What reasons would lead you to expect fairly high correlat? 
among verbal group intelligence tests? 


ive 
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8. If differences between the results of such tests are largely a 
matter of the standardization groups, can you see any way of avoid- 
ing them? 

9. If someone asked you to recommend an intelligence test for 
use in a school, in personnel work, or elsewhere, what considerations 
would you have in mind in choosing one? 

to. In what respects and for what reasons might one expect better 
tests from nonprofit organizations than from commercial publishers? 


CHAPTER VI 
TESTS OF INTELLIGENCE (II) 
INTRODUCTION 


As we approach the topics of tests for young children, and for 
adults in general without reference to educational status or inten- 
tion, it may be well to remark that most mental tests have been 
designed for school ages. This is particularly true of group tests, 
for the application of instruments of measurement to the very 
young virtually requires individual administration. Some of the 


more recent scales, however, extend well below the school ages: 


The Revised Stanford-Binet scale reaches the age of 2 years. Th? 
Kuhlmann Tests of Mental Development go to the age of 3 
months. The California Tests of Mental Maturity have a section 
for kindergarten children. On the other hand, the Revised Stat” 
ford-Binet scale has four test groups for adult levels, and thé 
/Wechsler-Bellevue scale extends upwards to the age of 60. since 
these instruments have already been discussed under other hea” 
ings, they will not be mentioned again here, except incidentally: 
We shall deal with tests specifically designed for young childre™ 
and specifically designed for adults. Tests developed in wor! 
War II are of particular importance in the latter category: a 
though there have also been a few civilian tests of this kind. 


1. Minnesota Preschool Scale 


/ The Minnesota Preschool Scale is designed for ages from ae 
to 6 years. It comes in two forms. A synoptic outline of the sca 
is shown in Figure 18. 

As will be seen, it consists of 26 items, generally comparable 
to those contained in the Revised Stanford-Binet scale for ©?! 
parable age levels. It was standardized on goo children of varie 
social and economic status, care being taken to avoid too high e 
percentage of privileged children in the group. A feature O 
scale is that it yields C scores, which have not been widely u 1s 
elsewhere. A C score is one which expresses the difficulty of iter 
that can be managed with a correctness of 50% (Kelley, 19r, n 
The diffculty steps are rated in terms of the standard deviat! 
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Fa an E: 


I. 


I3. 


Iq. 
I5. T 


I6, 


17. 


Pointing out parts of the body 
Showing parts of the body on doll. 
Pointing out objects in pictures 
Show a chair, etc. 
Naming familiar objects 
Five actual objects presented for naming. 
Copying drawings 
Circle, triangle, diamond. 
Imitative drawing 
Experimenter draws lines, designs; child imitates. 
Block building F É 
Twelve cubes to be built into various designs, copying experi- 


menter. 


- Response to pictures 


Three pictures, task to tell what they are about. 


- Knox cube imitation 


Four cubes nailed to base, one loose. Child to imitate various 


manipulations. 
Obeying simple commands , : : 
Handling and manipulating various objects on instruction, 


Comprehension 7 . A 
Telling what to do in various simple situations. 


+ Discrimination of forms 


Match form of actual objects from cards with pictured designs. 


Naming objects from memory 
A set of objects is shown an 
and the set is shown again, 
gone, 

R iti Bs. 
ee arms ved, and the child is asked 


A picture is shown briefly, then remo’ 1 
to match it from a card offering various choices. 


d then covered. One is taken away, 
child being asked to tell which is 


Colors 

Naming colors as presented. 

racing a form , 

Following forms with a pencil. 

Puzzle Series: Rectangular Series a. 
Pictures dismembered in rectangular directions to be re- 
assembled. 


Incom . 
plete pictures ; X sate 
Indicating omissions from pictures of simple obj 


180 PSYCHOLOGICAL TESTING 


18. Digit span 
Repeating series of digits given orally. 

1g. Picture Puzzles: Diagonal Series 
Like 16 above, but diagonally disassembled. | 

20. Paper folding | 
Experimenter folds a sheet, child to do the same. 

21. Absurdities 
Absurd sentences presented orally. 

22. Mutilated pictures 
Pictures with “something wrong” in them. 

23. Vocabulary 
List of words to be explained. 

24. Giving word opposites 
List of words, task being to give the opposite of each. 

25. Imitating position of clock hands j 
Cardboard clock, child to imitate hands variously placed by“ 
holding out arms. 

26. Speech 
General record of any sentence of five words or more spoke? 
by the child is credited on his score. 


Fic. 18. MINNESOTA PRESCHOOL SCALE. 
SYNOPTIC OUTLINE 


of the score on the assumption of normal distribution, the purport - 
being to obtain scoring units which express equal increments ‘a 
difficulty at all points in the scale. Also, the scores can be í 
verted into intelligence quotients and mental ages. The scale y!® a 
both verbal and nonverbal scores separately, but this is not er an 
mended in the case of children below the age of 3 years- 

reference is the manual, but also see DeForest. 


2. Merrill-Palmer Scale of Mental Tests + 


This is another widely used set of tests for the measurement to 
intelligence in young children. It is designed for ages from ale 
63 months, and comes in one form. A partial synopsis of 
is presented in Figure 19. . palm” 

The scale was worked out in connection with the Merrill-P2 ge 
School in Detroit. It consists of 38 subtests which were © ti” 
from a tentative list of 79. Item selection and standard! 

+ Reference: Stutsman. 


= 
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Were based upon a group of 631 children from varied environ- 
ments, care being taken to avoid a preponderance of those enrolled 
in or on the waiting list of private schools. 

_ The subtests were selected on the following criteria. (a) Attrac- 
tiveness to children. (b) Sufficient variety to sample a wide range 
of abilities. (c) Significant and marked differentiation of difficulty 
With age, so that 4 months’ difference in mean age was readily 


EIGHTEEN TO TWENTY-FOUR MONTHS 


2.* Throwing Ball , 
Child is given tennis ball and told to throw it to person 


giving test. 


3.* Straight Tower š 
Child is told to copy experimenter who builds tower out of 


scattered blocks. 


9. Repetition of Words on 
Experimenter asks child to say “kittie, then presents three 


other words all at same time. 


Ir* Folding Paper 
Child watches while experimenter folds sheet double and 


opens it out, then asked to do the same, i.e. to “make a 
little book.” 
TWENTY-FOUR TO TWENTY-NINE MONTHS 


13.* Identification of Self in Mirror ; 
tee shown own reflection in mirror and asked to tell by 


name who it is. 

17." Drawing up String 
Child watches while exper! 
tied to it, then asked to do the same. 


19.* Questions . 
Child is asked ten very simple questions. 


THIRTY TO THIRTY-FIVE MONTHS 


23." Matching Colors 
Putting capsules colored red, blue, green, 


menter pulls up a stick by a string 


yellow into boxes 


of same color. 

31. Seguin Form Board 
Fitting the inserts int 
Board. Should take 222 seconds or | 


o the ten spaces of the Seguin Form 
ess. 
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32.* Repetition of Word Groups 

Similar to repetition of words above. 
THIRTY-SIX TO FORTY-ONE MONTHS 

37-* Copying a Circle 
Child shown a card carrying circle r-inch diameter and asked 
to make one like it, using a pencil. 

46. Action Agent 
Child is asked “What sleeps?”, “What cuts?”, etc., and scores 
by indicating either the agent or the object of the activity. 

FORTY-TWO TO FORTY-SEVEN MONTHS 

49. Seguin Form Board 
As above, but timed to 72 seconds or less, 

56.* Copying Cross 
Procedure similar to copying circle above, 

57. Mare and Foal 4 
The various pieces of the puzzle completion board to bé8 
fitted together to make the picture of the mare and foal 
which is presented. 

FORTY-EIGHT TO FIFTY-THREE MONTHS 

60. Seguin Form Board 
As above, but timed to 63 seconds. 

67. Mare and Foal 
As above, but faster timing. 

69. Four Buttons 


Buttoning four buttons attached to strips of cloth into button” 
holes. 


FIFTY-FOUR TO FIFTY-NINE MONTHS 
82. Copying Star 
Similar to circle and cross above, 
SIXTY TO SIXTY-FIVE MONTHS 
Tests here are similar to above, but with higher norms. 
SIXTY-SIX TO SEVENTY-ONE MONTHS 
Tests here are similar to above, but with higher norms. 


* Numerals are the serial numbers of the subtests in the scale, Asterisks 
eate tests which do not appear at higher levels, —— 


Fic. 19. SAMPLE SUBTESTS FROM THE MERRILL-PALMER SCALE 
(Stutsman) 
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indicated, (d) Avoidance so far as possible of influence by train- 
Ing and environment. (e) Attainment of an equal spacing of steps 
or units of difficulty. (f) Universality of experience, (g) Ease of 
administration. The subtests were located in the age subdivision 
at which 50% of the group were able to pass them. 

As will be seen, the scale is divided into age levels at steps of 
6 months. If a child passes half or more of the subtests at any 
level, he goes on to the next. One point of credit is given for each 
Subtest passed. Omissions and refusals are credited as successes 
When the number ascribed to the subtest in question is below the 
child’s total number of successes, and otherwise as failures. Part 
of the reason for this is to avoid an undue effect from the nega- 
tivism often shown by young children confronted by this or that 
item, which is a disturbing and falsifying influence in measure- 
Ment at early stages. 

The scale yields three types of scores as follows. (a) Mental 
ages, norms for which are given in conversion tables showing raw 
Score equivalents. Thus a raw score of 47 indicates a mental age 
of 39 months, (b) Standard scores, for which again conversion 
tables are given in the manual. Thus a raw score of 47 is at the 
Mean for 39 months, 2 S.D. above the mean for 31 months, and 
2 S.D. below the mean for 55 months. (c) Percentile scores. Thus 
47 is at the median or soth percentile for 39 months, and at the 
95th percentile for 32 months. It should be noted that the scale 
does not yield intelligence quotients, for a reason to be presented 

ow, ; 

Of the subtests, 16 are scored all-or-none, and 22 have variable 
Score values, The subtests are not massed according to the func- 
tions Measured. Thus the instrument conforms to the Binet com- 
Posite pattern and is committed to the idea of averaging a number 
of undefined abilities. , a, . 
ith regard to validation the following positive evidence is pre- 
Sented, (a) The scale differentiates well between children rated 
on intelligence by the staff of the Merrill-Palmer School. (b) Total 
Scores on the scale correlate with chronological age .g21 + 004, 
orrelations’ of .793 =Œ .192, a 049, and .783 = .o25 
With -Binet scale are reported. 
pea as with the Stanford-Binet scale have been 
Seriously questioned. Wellman (1938) tested 281 children from 
20 to 62 months old with the Merrill-Palmer scale, deriving per- 
Centile scores, standard scores, and intelligence quotients, On 
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retesting somewhat later she found frequent gains, particularly 
among the superior children. In particular, she found quite a low 
correlation with the Stanford-Binet scale given at about the same 
time. DeForest (q.v.) also reports low correlations between the two 
scales, and finds that they tend to decrease at the age at which 
the Merrill-Palmer score is obtained increases. 

One point connected with Wellman’s work calls for comment. 
She found that intelligence quotients based on the scale are very 
unstable. This, however, is to be expected, for as Stutsman (193 1) 
in the manual points out, it cannot properly be used for the deri- 
vation of I.Q.’s. The reason for this is that the distribution © 
mental ages at various chronological ages is not stable. Thus an 
1.Q. which would be 2.5 S.D. above the mean throughout woul 
vary all the way from 122 to 165, an I.Q. consistently 2.0 S.D. 
above the mean at all ages would range from 119 to 154, an LQ. _ 
1.5 S.D. above the mean at all ages would range from 114 tO 141/7 
and an I.Q. 2.5 S.D. below the mean at all ages would range from 
58 to 70. Thus it should be recognized and clearly understoo' 
that the use of this scale to determine intelligence quotients is no 
legitimate. Yet apparently even careful workers cannot resist 
Jure of this glamor score. 

By way of general summary and appraisal, the following points 
should be noted. (a) The instrument is an excellent supplement 
to the Stanford-Binet scale, but for independent appraisals Pt? p 
ably not so satisfactory. (b) It seems to tap a somewhat limit? 
range of abilities. Gross motor, locomotor, and postural behav! 
patterns are not recognized, though they are prominent in 
reactions of children. Drawing, again, is significant enoug 
deserve inclusion in more than three of the subtests. Lang¥@° 
could be more emphasized with probable advantage, the chie 
subtest recognizing it being the Action Agent, probably the ber 
subtest in the scale. (c) It makes more of the speed factor tha 
could be wished. 


3. The California Preschool Mental Scale * r 
This is another instrument for individual administratio”: a 
tended for ages from 114 to 6. In content it is in general $! fe 
to the two already mentioned. The subtests fall into 10 catego" wo 
(1) Manual facility. (2) Block building. (3) Drawing. (4) gist 
discrimination. (5) Spatial relationships discrimination. (6) 
+ Reference: Jaffa. 


Ay 
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and number discrimination. (7) Language comprehension, (8) 
Language facility. (9) Immediate recall. (10) Completions. There 
are one or more subtests of each type at each age level. This is 
done in an attempt to make the scale homogeneous throughout 
the entire age range and constitutes one of its distinctive features. 

It yields three types of scores, to wit, approximate mental ages 
and resulting intelligence quotients, standard deviation scores, and 
Profiles based on the various types of tests. The scoring has been 
found to be rather difficult, and the criticism has been made that 
the manual does not sufficiently describe the responses which are 
to be considered satisfactory in view of the very varied and fluid 
reactions of young children. Judging from the size of the standard- 
zation group the norms should have an adequate foundation, but 
Complete details are not given. 


4. California First-Year Scale ¢ 
This scale, for individual administration, is intended for infants 
from 1 to 12 months old. The items and separate subtests of this 
Scale are not dissimilar in general type from those usually found. 
distinctive feature of its construction is that it was based upon 
a sequential study of the same group of 61 infants who were 
tested at monthly intervals from 1 to 15 months, and then at 18 
and 21 months. It was not always possible to secure all the mem- 
bers of this group for each testing in the sequence, but there were 
never less than 46. Compared to the conventional standardization 
8toup this one, of course, is very small. But the principle of 
following up the same children over a considerable period of time 
1S an excellent one. It should be noted that the children involved 
Were a rather highly selected group. i 
he scoring is either in absolute scale units, or in standard 
deviation scores, or in mental age units. The last named practice 
1S not recommended. In general the scale is regarded as an excel- 


lent practical instrument. 
5. Iowa Tests for Young Children * 


This instrument, outcome of twelve years’ work, is for children 
4 months to 2 years old. It has 48 subtests, as shown in the 


Synoptic outline in Figure 20. 
feature of this scale, as may be gathered from the synoptic 


È Reference: Bayley, 1933- 
* Reference: Fillmore. 
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x. Sitting on lap unsupported 2s. Putting cork in bottle 

2. Accepting second cube offered 26. Putting 2 boxes of nest to- 
3. Taking cup off hidden object gether 

4. Attempting to stand 27. Putting penny in bank 

5. Reacting to image in mirror 28. Piiing biocks 

6. Locating sudden sound 29. Placing pegs in board 

7. Carrying ring to mouth 30. Placing cubes in box 

8. Examining object 31. Putting 3 nest boxes together 
9. Looking on floor for fallen ob- 32. Throwing ball to examiner 

« ject 33. Putting sand in jar, awkwardly 
zo. Sitting on table or floor un- 34. Pointing to object in picture 

supported 35. Putting key in padlock 

tx. Attempting to ring bell 36. Cubes in box 

12. Showing interest in picture 37. Rolling ball to examiner 
13. Poking at pellet 38. Pointing to features we 
14. Ringing bell 39. Unscrewing jar lid 

15. Raising self by chair 4o. Skeels form board 

16. Hunting covered object 41. Cubes in box 
17. Picking up pellet 2. Skeels form board 
18. Walking with help 43. Naming objects in picture 


19. Trying to put cork in bottle 44. Putting sand in jar 

20. Trying to put penny in bank 45. Drawing circle, hand guided 
21. Marking with pencil 46. Putting all nest boxes together 
22. Accepting third cube 47. Matching boxes and covers 
23. Placing cubes in box, score 3 48. Skeels form board 

24. Trying to put sand in jar 


eee 


Fic. 20. Iowa Tests ror YOUNG CHILDREN. 
Synoptic OUTLINE 


list of items presented, is its avoidance of the heavy loading wit? 
material of a personal and social kind which seems unduly mi 
enced by home surroundings, and of a purely motor kind. The 

is for each item a steady increase in the percentage passing we 
age to age, and each item is found to differentiate to 4 mat ale 
degree between those who are bright and dull judged by the sche 
as a whole. This is perhaps the best validation evidence whic ayer 
author is able to present, and of course it is far from conci? tye 


Yet another feature of statistical and technical interest 18 
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method used for determining the age placement and order of the 
Items. The procedure, based on assigning them to age groups in 
which they were passed by 50% to 60% of the subjects, was found 
Unsatisfactory, and so they were Placed “at par” according to the 
method developed by Thurstone (1925). The formula by which 
this was done was as follows: 


2 Pi 
y = age at which below 50% correct answers occur 
Pi = percentage passing item at age y 
P2 = percentage passing at age y+ 
(one class interval of age) 


6. Motor Achievement Test * 

This test has a somewhat different purpose and is of a some- 
What different type from the foregoing instances. It calls for per- 
formance in 4 categories of tasks. (1) Ball activities. Bouncing 
and throwing a ball in various ways. One type of reaction is for 
the child to bounce and throw the ball with a location field marked 
on the floor. He stands on the edge of it, and throws or bounces 
the ball to the examiner. Evaluation depends on his success in 
Staying within the indicated zone, on distance, and on his use of 
One or both hands. Another type of reaction is the catching of 
the ball when thrown to him by the examiner at the level of his 
Chest. Two balls are used, one of 9! inches and the other of 
16% inches circumference. Three trials are allowed for each per- 
for mance, (2) Hopping, skipping, walking. The consideration here 
Is ability to maintain equilibrium. The test calls for walking in 
à path and a circle, the path being ro feet long and 1 inch wide, 
the circle being 4 feet in diameter; and scoring depends on the 
number of times the child goes off the indicated track, three tries 
being allowed. As to hopping, this calls for hopping with one or 
both feet, and also walking and shuffling. In skipping, the child 
imitates the examiner. (3) Jumping. The child is required to jump 
Out of boxes of four heights: 8 inches, 12 inches, 18 inches, and 
24 inches. He is given three tries at each height, and the method 
he Uses is checked. (4) Stairs and ladders. This calls for ascend- 
img and descending stairs with different numbers of steps. For 


ladder reactions, two kinds of ladders are used, one with 12 rungs 


* Reference: McCaskill and Wellman. 
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6 inches apart, another with 6 rungs 12 inches apart, both being 
set up at 45 degrees. 

The test is organized to yield motor ages. These are determined 
by the age levels at which so% of the tasks are satisfactorily 
passed. There are separate scores for the four categories above- 
These scores show fairly high intercorrelations, but they would 
probably be much lower if age were held constant, A retest cor- 
relation of .98 is reported on repetition of the test after one week: 

Some light on the general significance of this instrument and 
the performances it elicits and endeavors to appraise may be 
gained from Bayley’s study of the relationship of motor an 
mental development in young children (Bayley, 1934). She finds 
that these two broad categories of behavior have different grow! 
patterns. Motor development in the first two years is more marke 
than mental development, according to her reports, after whic 
it slows up. Little is known about the predictive value for Jater 
behavior of motor control at early ages, or about its relationshiP 
to and predictive value for mentality. 


7. The Developmental Examination 


An approach to the problem of appraising the behavior, me?” 
tality, and development of infants and preschool children very 
different from those so far considered has been achieved by Ges¢ 
and his co-workers (q.v.). They have devoted lengthy and comi 
prehensive attention for many years to all aspects of early mentä 
and behavioral growth. Out of their studies has come a techniqt® 
for developmental examination which records and evaluates the 
phenomena of the individual child’s growth from year to yer 
and almost from month to month. It is far more inclusive tha 
the ordinary test and also far more significant. The development? 
examination requires a trained clinical examiner and calls for mU° 
skill and interpretive insight. However, its general aspects g 
pertinent to the topic of this chapter. It cannot be describe 
any fulness here, but its general purport can be made reso s 
ably clear to the reader. He should consult the various writ”? 
of Gesell, and more particularly Gesell and Amatruda which P 
sents the latest and most comprehensive account, and also 6 
and Thompson, 1938. . ete 

The developmental examination is a codification of the div® of 
and complex phenomena of human development into a series pe 
schedules. These schedules indicate the behavior patterns to 
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expected for every 4 weeks of life from birth to r2 months, and 
thereafter for every 3 months up to 42 months. Two of these 
Schedules are summarized in order that their general drift and 
Significance may be understood. 

At twelve weeks the child, when supine, is found with head 
Predominantly half side, less fully rotated than when younger. 
Chin and nose are in line with median line of trunk. Arms are 
symmetrically disposed. When sitting, his head is set forward, 
erect, but bobbing and unsteady, somewhat thrust forward. Prone, 
he rests on forearms with arms flexed, weight resting on elbows 
and forearms, When presented with an inverted cup which is part 
of the testing material, he regards it more than momentarily and 
Contacts it. So also with the cube which, too, is part of the testing 
equipment. When a ring dangling from a string, which again is 
Used in the testing, is presented to him he follows it visually for 
aS much as 180 degrees. His vocalization is marked by chuckles 
just short of laughter. In his social-vocal response he vocalizes 
in Some manner, or “talks back” in response to social-vocal stimu- 
ation. In his spontaneous play, he brings one or both hands 
before his face for regard. ' . , 

At thirty months the child walks on tiptoe, jumps with both 
eet, tries to stand on one foot when encouraged to do so and 
Shown how by the examiner, holds a crayon in fingers instead of 
in fist, He can build a tower of eight blocks which should be well 
rnOugh constructed to stand alone. (Blocks also are part of the 
testing material.) He uses cubes to add a chimney H a train 
made out of cubes, when the examiner asks eae mney is. 
Re raws two or more strokes for a cross. He : a a 2 p Sa 
Correctly one of the color eae pem T bs ptt 
Cuts in : ard, and adapts, althoug] Bet 
the boned ee hoari z changing its position relative to him, 
He can give his full name. He can indicate the uses of the test 
objects, such as keys, and so forth. In communi- itign he on 

himself by pronoun rather than by name. He shows repetitive- 
ess in g d other activities. . ae i 
Kie st a a Geitite Peata teeh 
ot and rating blanks have been prepared for recording 
CAavior interview. 3 ; er 

inoen in conjunction with a proper a hear 
clinica] evidence, such an examination ollers highly s Bn a 
and Pertinent data. Gesell reports that such a comprehensive 
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inati i ted is of high 
ental examination competently interpre 
i a value, and indicates later behavioral and mental status 
Pith a high degree of probability (see Gesell, 1940). -red 
The developmental schedules are available in separate prin 
sheets, suitable for convenient use. 


EVALUATION OF INFANT TEsTS 


In view of the great present interest in the possibility tas 
environmental influences may greatly affect early growth, an 
informed and judicious opinion as to the values and eran 
of the scales and tests by which infant mentality is determine 
is highly desirable. i 
i ig es there are two approaches, related, no doubt, but be 
many essential respects very different, to the evaluation of infan 


an array of items more 
reference to a sample group, 
The other, represented by th 


(Bayley, 1933). In all, 49 of 
available throughout a three-year Period. The: ae 
variety of tests, out of which the items for the scale just me 

tioned were selected. As the table shows, score 


so, ky 
yy 
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a reasonably high predictive value over a short period of time 
For Instance, scores made during the 7th, 8th, and oth months 
Correlate as high as .81 with scores made during the next three 
months, namely the roth r1th, and 12th. But these former scores 
correlate only .22 with scores achieved during the period of the 


TABLE 17 


CORRELATIONS oF SCORES OBTAINED DURING CERTAIN PERIODS WITH 
Scores OBTAINED DURING LATER PERIODS 


(Bayley, 1933 a, Table 9, p. 47) 


a CORRELATIONS WITH AVERAGE SCORES FOR SUBSEQUENT 

CORES FOR THREE-MONTH PERIODS 

THREE-MONTH 
PERIODS 456 | 789 


Io II I2 | 13 14 I5 | 18 21 24 27 30 36 


57 42 28 10 —.04 —.09 
72 .52 .50 +23 +10 

8 .67 +39 22 

„81 .60 45 

+70 +54 


e significant thing to notice in 


27th, 3oth, and 36th month. Th thing > 
dicated relationship as the time 


this table is the steady drop in in d rel 
€tween the initial testing and later testings increases. In the same 


Way, as is shown in Table 18, an intelligence quotient obtained at 


an early age shows more and more instability as the time interval 
elore retesting becomes longer. It will be noted that the mag- 
nitude of the mean changes, both positive and negative, shows a 
marked increase as the time interval lengthens. The reader, of 
Course, should understand that the intelligence quotients here 
Ported were obtained at very early ages, SO that the study has 
] problem of I.Q. constancy. 

tudy of 252 children, using the 


®nly a slight bearing upon the genera 
from 11 months to 5 years, and 


re (1938) again made a s 

Morni for ages 

nia Preschool Scale for ag Dees te & are Se Did 
th ae le Q. obtained at about 21 months 


Very unstable, and does not afford a reliable index of scores on 
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the Stanford-Binet scale later on. However, she also reports that 
by the time the age of 3 years has been reached, mental test 
performance seems to have become decidedly more stable, and 
that significant predictions can be made. Some of the data she 


TABLE 18 


AvERAGE CHANGES IN 1.Q.’s OF A Group oF YOUNG CHILDREN OVER 
CERTAIN PERIODS 


(Bayley, 1940 b, after Table 3, p. 19) 


INCREASES DECREASES No CHANGE 
AGES (IN MONTHS) =e 
BETWEEN WHICH Amount Amount 
LQ.s COMPARED | Number change Number chienee Number 
——s— 
21 6.72 23 5.06 I 
22 9.86 22 5.59 o 
25 10.24 17 4.53 3 
29 9.09 17 8.36 o 
28 14.45 16 6.2 o 
2 12.64 12 8.70 o 


obtained are summarized in Table 19. This tabulation shows the 
relationship of testings at various ages to an initial testing a 
r year and 9 months, and to a final testing at 7 years. Tt shoul 
be remembered that two different scales were used in this work- 
the California Preschool Scale up to the age of 5 years, an afte 
that the Stanford-Binet. The point, of course, to notice iS tha 
the relationship to initial status grows less and the relationship 
final status grows greater as age increases. It should be pointe 
out also that the relationship of the 3-year testing to final sta Y 
is already .56, that none of the correlations are strikingly “ae 
and indeed that a correlation of .42 between the test performans 
of a child at 1 year and 9 months and at 7 years might well P 
considered quite reasonable. at 

Nevertheless it is quite true, as Bayley in several places porn 
out (1933, 1939; 1940), that no uniform over-all global score W iye 
can be derived from our present tests has a high predictive “gge 
over the period of early development. She found a so-calle 
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velopmental score,” which was a composite of performance on 
mental and motor subtests, no better as a prognostic index than 
a mental score alone. Also, she reports that vocabulary tests at 
6 to 9 years are only moderately related to language tests at 314 


TABLE 19 


CORRELATIONS or TEST PERFORMANCE AT VARIOUS Aces WITH INITIAL 
STATUS AT 1-9 YEARS AND FINAL STATUS AT 7-0 YEARS 


(Honzik, 1938, from Table 5, p. 295) 


ratiem ana mantle Initial Status |Final Status 
I-g years 7-0 years 

42 
.68 46 
59 238 
47 .56 
.50 63 
46 66 
-32 “73 
«30 SI 
42 


ge of first talking and to very early 
d and puzzle board performance at 
1 tests and vocabulary 


tests at the same ages, but not to tests of general ability at the 
age of one year. She argues, and very soundly, that intelligence 
quotients reported a very early ages are deluding. This may very 
well be the case, for the uniform relationship between mental and 
Chronological age upon which the whole meaning of the intelli- 
ence quotient entirely depends has not been established in the 
Standardization of tests for these early ages. The persistent use 
of the 1.Q. as an index of mentality in early childhood is yet 
another instance of falsifications due to the popularity of the 
Measure. Bayley’s explanation of these summarized findings is 
that the whole mental make-up changes with development during 
the early years of life: Quite probably it does. But one must also 


years, and not at all to the a 
Mental test scores. Form boar’ 
5% is related to scores on general menta 
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remember that little children are difficult test subjects, and that 
in work with them all sorts of variable errors are likely to crop up- 

So, too, somewhat low correlations are reported as between 
the various infant tests themselves, and between these tests and 
the Stanford-Binet scale. To give a specific case, a series of corre- 
lations between Merrill-Palmer scores in sequential 6 months 
intervals and a final Stanford-Binet record at 6 years ran from 
-34 to .66. (For this topic see De Forest, Fillmore, Wellman 1940 a) 
With regard to the norms set up by Gesell, ratings on the De- 
velopmental Examination made at the age of 6 months do not 
well predict standing on the Merrill-Palmer scale at 2 years, 
though, of course, it is an open question which rating is the more 
valid. For 123 children it was demonstrated that Merrill-Palmer 
Scores at 2 years and Stanford-Binet performance at 3 years, 


respectively, correlated .37 and .46 with Developmental Examina- _ 


tion ratings at 6 months. The suggestion once more is that these 
rather low correlations may be due to a shift in mental organiza- 
tion itself, for a factor analysis of the Developmental Examina- 
tion Schedule for 6 months has indicated that it measures the two 
factors of alertness and motor ability (Nelson and Richards; 
Richards and Nelson). 

3. The relative instability of test results obtained during early 
childhood, and the relatively low agreement of such tests both 
with one another and with tests for later years involve some 0 
the most fundamental of psychometric issues. 

A. It may be due to characteristics in the tests themselves. AS 
will be seen from the synopses of several important tests for 
young children which have been presented above, they consist 
preponderantly of performance-type items, Performance tests, 
even when given later in life, do not correlate as highly wit 
verbal tests as verbal test do with one another, 

B. It may be due to the characteristics of the subjects. Little 

“children tend to exhibit negativism and to refuse to do what they 
are asked to do, in particular by a stranger. They are shy. Their 
responses are very varied, unpredictable, and hard to classify a” 

. evaluate. There is a constant problem of rapport. All this means 

“ increased chances for variable error. Older children, on the other 

hand, are not only more stable and manageable constitutionally 

but they tend to acquire in school just the attitudes desired } 

good test subjects. m . tb. 
C. It may be due to the rapidity and fluidity of early grow 
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That all development proceeds very rapidly in early life seems 
well established. And the changes may be qualitative as well as 
quantitative. Both Stoddard (1943) and Bayley (1933 a) have sug- 
gested that the organization of the mind may alter in itself in 
the first two, three, or four years of a person’s life. 

D. It may be due to the cumulative effects of the environment. 
That environmental influences are peculiarly potent during early 
childhood is by no means an unreasonable proposition. If it is 
true, then test performance would clearly change, and the change 
Would become greater with the lapse of time. As several investi- 
8ators have remarked, among them J. E. Anderson (1939), chrono- 
logical age is by no means the uniform factor it is ordinarily 
assumed to be. Two children with a chronological age of three 
May have spent those three years very differently, and the dif- 
ference may affect their whole mentality. This, above all, is the 
Point about which controversy is now centered. 

4. Atkins (q.v.) has presented an excellent summary of the 
Criteria by which a test intended for young children should be 
Judged. (a) Its material should be intrinsically interesting, as one 
Way of avoiding negativism and indifference. (b} It should require 
a minimum of oral directions, again to avoid as far as possible 
the effects of shyness and poor rapport. (c) It should demand 
Only a brief span of attention for each item, since little children 
are high distractable. (d) The materials should be as simple as 
Possible, (e) So far as possible, the test content should be based 
on and selected in terms of equality of previous experience, 
although this cannot be fully attained. (f) So far as possible the 
test items should be noncommunicable, so that the child’s mother 
Or older guardian can be in the room while they are run, without 
risk of suggestions and expressions of approval, disapproval, and 
So forth. This again is for the sake of rapport, and to avoid nega- 
tivism and shyness. (g) Credit should be given for each actual 
Tesponse, not for two out of three, or only for all ten, etc, (h) 
Conditions for administration, and also the scoring instructions, 
Should be as objective as possible. Many such tests place alto- 
8ether too much reliance on the judgment of the examiner in 
both respects. (i) The standardization should be adequate. (j) The 
test should be set up for complete presentation of relevant data 


° make research possible. ; - 
irtually all tests for young children are open to criticism on 


ne or more of these criteria. P 
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5. To summarize, there is no doubt that for one reason or 
another, and probably for several reasons, retest correlations, 
intercorrelations at identical ages, and correlations extending over 
considerable periods of time are lower with young children than 
with older children. Yet it is by no means justifiable to claim 
that test results with young children have no prognostic value, as 
some would seem to suggest, and that general conclusions based 
upon them are to be disregarded. An examination of the data 
presented in Tables 17, 18, and ro bears this out. In Table 17 it is 
shown that the results of very early testing have little relationship 
to test scores earned some years later. The same appears from 
Table 19. But when the three-year-old level is reached, the cor- 
relations begin to assume indubitable significance. Thus in Table 
19 it appears that test scores at age 3 correlate .56 with those at 
age 7, and many comparable or higher coefficients are to be found 
in these data. If this is not a finding peculiar to these particular 
investigations, but indicates approximately the true relationship, 
it then compares quite favorably to the relationship betwee! 
mental test scores obtained in high school, including senior years 
and achievement in college, which is surely about as much as one 
could reasonably expect. Indeed, as one surveys these figures as 
a whole, it is difficult to find in them anything disastrous to the 
status and general prognostic value of tests for young children- 
Of course, such tests must be used and their results must Þe 
interpreted with special care—although such a reservation ca? 
very well be made with respect to all psychometric results. N° 
doubt, too, as J. E. Anderson (1939) and others point out, the 
younger the subject, the greater the risk of error. But on the face 
of the evidence, persons who do not like certain conclusions, SP° 
cifically regarding the influence of the environment, which have 
been drawn from the testing of young children, must rebut the™ 
in some more convincing way than by a wholedaile attack upo” 
the scales themselves. Unless, indeed, they are prepared to rejec 
every kind of psychometric evidence. 


TESTING ADULT INTELLIGENCE 


Turning now to the other extreme of the age range, namely 
adult mentality, still further practical and theoretical psycho 
metric problems and issues of the first importance are involve ss 
The situation can be summed up briefly and simply. Very few 
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existing instruments of mental measurement are well suited to 
deal with adults. When so used, the tests that are available are 
apt to lead to very misleading conclusions. It is both important 
and revealing to analyze the reasons for this striking limitation 
and to ask how and whether it can be overcome. i 


l. Reasons for the unsuitability of existing tests 


Most, although not all, existing tests are not well suited for the 
measurement of adult intelligence for the following reasons. 

A. Tests have been prevailingly standardized on persons in 
school. This is partly, though not entirely by aily means, a matter 
of sheer Convenience. It is much easier to secure standardization 
Stoups of adequate size from school populations than from among 
Independent adults. One thing that made the standardization of 
the Army tests feasible was that the subjects were under orders, 
and so could be made available as needed. But to do this with 
adults from the general population, and to secure adequate num- 
ers willing to submit to the sometimes rather lengthy processes 
necessary, is quite another matter. So school groups have been 
very largely drawn upon. This involves several limitations. (a) It 
Means that standardization is run on persons within a limited 
range of age. (b) It means that whereas the norms may represent 
an unbiased sample of the school population, they almost certainly 
involve a certain bias with respect to the general population. The 
Mentality of pupils in school is probably different in various 
Undefined but not unimportant respects from that of independent 
unselected adults, (c) Perhaps most important of all, such stand- 
ardization groups very easily involve a factor of selectivity, par- 
ticularly if they are chosen from the upper grade levels. Thus, 
When the norms so obtained are applied to the general population, 

1e results are often disconcerting and even fantastic. The out. 
Stan ing instance of this is the finding that the average mental 
age of the population of the United States 1s approximately 13 
Years, which came from the first Army testing program. 

io, But the use of school groups for purposes of test construc- 
tion involves something far more than mere external convenience. 
mij S atic working conception of iin ep a A 
ing, © and operates as a kind of imp Jor p whic! 
pultences test construction in all its aspects. That conception is 
Y no Means invalid or erroneous, but it is limited and special. 


And its effect is constantly present. It provides ready-made, easily 
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available criteria for validation, which are easily expressed in 
numerical terms suitable for statistical treatment. These are the 
various measures of school progress and school achievement which 
can, if needed, be supplemented by teacher ratings. They are the 
criteria very widely used for item selection on the basis of power 
to discriminate the bright from the dull, and for the over-all 
validation of finished tests. There is a general but none too explicit 
agreement as to what validity actually means in practice, for the 
reason that the test is constructed in and for a milieu where an 
institutional conception of intelligence is operating. A valid test 
is one that agrees with this conception, which, to repeat, probably 
has authenticity and general significance within limits but is cer- 
tainly specialized to an undefined degree. We know what intelli- 
gence means in terms of school life and experience. It means 
school success. This is by no means unrelated to success in life in 
general, and particularly to the intellectual tasks of life. But 
neither is there any complete identity. 

Other and less comprehensive and manageable institutional 
criteria of mental quality have been used from time to time. One 
instance is institutional commitment and residence in homes for 
the feeble-minded and the mentally deranged, which have been 
used from time to time in psychometric work; and subject to 
reasonable clinical precautions, our best tests of intelligence and 
also of personality check up against them tolerably well. Attempts 
have also been made to demonstrate a hierarchy of intelligence 
with respect to various occupations. The most ambitious instance 
was in connection with the Army testing program during World 
War I, and instances of the broad results are presented in Table 
20. Work of the same kind, though less in extent, has been done in 
connection with the testing program in World War II (Harrell 
and Harrell). The outcomes are broadly similar, Occupations 
which show a mean score of 120 or more on the Army General 
Classification Test are accountant, lawyer, engineer, public rela- 
tions man, reporter, chief clerk, teacher, draftsman, stenographer: 
pharmacist, tabulating machine operator, and bookkeeper, in de- 
scending order of superiority. The lowest-rating occupations, 19 
order from low to high, are teamster, miner, farmhand, farmer, 
lumberjack, barber, laborer, truck driver, adii 

There does seem to be some sort of ranking of occüpationt 
intelligence, although it is not a very stable one and cahe 
great deal of overlapping. But to use it as an institutional wo 
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ne criterion for test construction would be impossible. Probably 
5 oy that for many occupations a certain minimal intelligence 
i A But the reason why persons of superior mental en- 
the ent are not usually found in unfavored vocations is not that 
a could not succeed in them, but that they do not like them. So “ 
i OE plied on green ranking is not at all simple or unequiy- 
anes and to select test items because, for example, they dis- 
fa inated between bookkeepers and station agents would be 
ntastic, 


TABLE 20 


INTELLIGENCE RATINGS IN VARIOUS OCCUPATIONS 
(Selected and adapted from Fryer, 1922) 


SSS=z 
Intelligence Group A Intelligence Group C 
Engineer Locomotive Engineer f 
Clergyman Policeman | 
Accountant Toolmaker 
d i Actor 
ntelligence Group B Tieman 
Physician Painter 
Teacher 
Accountant Intelligence Group C— P 
Dentist Hospital Attendant 
i Shoemaker 
Intelligence Group CH Sailor 
Bookkeeper Textile Worker 
Photographer i 
Railroad Conductor Intelligence Group D 
Fisherman 


Electrician 
Druggist 


cee ee a ee ne a 


‘ The truth is that in order to be of service in validation, our | 
vera conception must express itself not merely in a | 
al formula but in operating and tangible form. In practice this 
ean that it must express’ itself in some kind of institutional 
leu. Such a milieu is provided in an almost providential manner 

Y the school. But for the general adult population no such simple 


‘acticable criterion is at hand. 
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C. One of the assumptions underlying much test construction 
is the existence of a determinate relationship between chrono- 
logical age and mentality. Such procedures as bringing the mean 
test performance and the mean chronological age of age groups 
into equivalence, locating subtests at the point where they are 
passed by 50% of the age group, locating them “at par,” selecting 
test items because they show increasing scores with advancing 
age levels, and so forth, are simply ways and means of putting this 
hypothesis to practical use. If it were not true, they would all be- 
come meaningless statistical manipulations. There is every reason 
to believe that the assumption, with certain qualifications no 
doubt, is substantially true with young people. But when it comes 
to mature human beings, a very serious doubt arises. This is why 
on both the Stanford Revisions of the Binet scale all mental ages 
for the upper levels are derivative. Certain questions and objec- 
tions can be and in fact are raised concerning mental age deter- 
minations even with the young. But at least this kind of statistical 
interpretation of test performance has a ponderable reason back 
of it here. But when adults are concerned, the support for this , 
practice becomes much more shaky and uncertain. This, of course,” 
is why Wechsler, in constructing a test designed to deal with adult 

‘ intelligence, abandoned the whole concept of mental age, One 
cannot but feel some regret that he still retained the intelligence 
quotient, at least in name. To argue in favor of these measures, 
as Terman and McNemar have done, that laymen find them easy 
to understand, is not at all convincing. Perhaps what layme” 
really find easy is to misunderstand them. 

D. Mature persons are likely to become considerably, and 
indeed sometimes extremely, specialized in their interests an 
their patterns of mental activity. An automobile mechanic, for 
instance, may be highly ingenious, adaptive, and resourceful in his 
own field of endeavor, but quite the reverse if he has to deal wit 
problems of salesmanship or administration or finance. Sree 
bly one has to believe that intelligence possesses certain univers® 
characteristics. But to deal with them for the sake of measure 
ment, or for any other purpose, they must be approached throug 
their special manifestations. This is much more difficult with t 


highly specialized adult than with the relatively unspecialize, 
child. It follows, therefore, that the sort of over-all average ratl in 


which is the true psychometric meaning of general intelligence x 
the Binet tradition (and Binet also worked with children) is ™U 


pe 
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More likely to approximate the true mentality of a child than of 
an adult. 

Along the same line, too, it should be pointed out that the back- 
&rounds of a group of children are almost certain to exhibit more 
uniformity than those of a group of adults, for the very simple 
reason that cumulative differences have had less time to build up. 

So, while it is Presumably true that intelligence has the same 
Universal characteristics in children and adults, its manifestations 
in the latter become more complex, diverse, and specialized. This , 
is One of the chief reasons why it is harder to deal with adult 
Intelligence by psychometric techniques. , 

E. Finally, conventional test material is not well suited to 
adults, This is almost inevitable, because the great pool of test 
items which has been accumulating through the years has come 

the main from tests designed for, tried out with, and adminis- 
tered to children and young people. The kind of test items com 
monly used often strikes adults as silly, trivial, merely manipula- 
tive, concerned only with word juggling, requiring nothing but 
information, etc., etc. Such objections apply both to verbal and 
Performance material. Above all, there is a strong tendency in 
Many tests to emphasize speed, whereas the desire of many adults 
S to ponder before deciding. When one recalls the strong and most 
€mphatically legitimate insistence of Stoddard and Boynton that 
intelligent behavior cannot be dissociated from attitude, motiva- 

ion, and a sense of the significance of the task, the seriously 
disturbing effect of such material becomes abundantly evident, 


2, Evaluation for adult use of tests already discussed 

With these problems in mind, it is desirable to evaluate well- 
‘nown general tests from the standpoint of their suitability for 
adult use, This can be done quite briefly, since representative 
instances have already received a more extended general dis. 
Cussion, 

A, Army Group Intelligence Examination Alpha. This was con- 
Structed and used as a test for adults. It was aimed at a group 
With a very special background, but specialized items can readily 
be edited out, as the various revisions have shown. It has, how- 
“Wer, many limitations for general adult use. The speed factor is 
Unduly Prominent. It is a purely verbal test, The item content 
Often Strikes adults as trivial. But above all, it is so easy that it 
Cannot reveal the mentality of highly endowed adults. For adult 
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groups of low to mediocre intelligence, one or other of the re- 
visions of the original Alpha may be serviceable instruments, 
though even so the factor of adult specialization readily vitiates 
the ratings. One may say that even with adult groups who will 
not reach its ceiling it should be used in conjunction with other 
tests designed to reveal special aptitudes. 

B. Otis Quick-Scoring Tests of Mental Ability. Much the same 
considerations apply here, except that the test does not contain 
such specialized items as Army Alpha. But it does emphasize 
speed of response. It is too easy for a considerable percentage O 
the adult population. And it has the limitations of any general test 
when so used. 

C. Thorndike Intelligence Examination for High School Gradu- 
ates. This test is of a difficulty fully adequate for adult use. Its 
limitation is that it is designed for a special group, namely persons 
intending to enter college. For them it is well adapted. But it 15 
strongly academic and verbal in its emphasis and could not be 
used upon an unselected adult population except for some specific 
purpose, and with qualifications and supplementation. 

D. American Council on Education Psychological Examination 
for College Freshmen. Once again this is a test of quite adequate 
difficulty. But once again it is directed to the special purposes 9 
a special group. In terms of that group it is a good test of intelli- 
gence, and its item content is not very likely to be decried a5 | 
inconsequential. But it explicitly centers upon verbal ability a” 
numerical ability, and this clearly limits it for general adult us¢ 

E. Revised Stanford-Binet Scale for the Measurement of Intel- 
ligence. The scale extends well into the adult area, and can 
undoubtedly be used to good effect for some of the purposes ° 
adult testing. Any age scale, however, becomes doubtful when 
applied to persons above the level at which performance ceases to 
improve regularly with age, i.e., the age of arrest in terms of the 
instrument. Experience shows that adults often find the item co?” 
tent somewhat objectionable. Moreover, it is the outstanding 
example of a comprehensive sampling test, and thus is likely to be 
subject to a peculiar extent to falsification by adult specialization: 

F. E.R. Intelligence Scale CAV D. This test is of ample difi- 
culty for adult use, only a very small percentage of the populatio? 
scoring in the upper decile. But it is intentionally narrow an 
centers entirely upon expertness with words and symbols. - 

G. Wechsler-Bellevue Intelligence Scale. This is generally 4 
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Cepted as in many respects the most satisfactory all-round adult \ 
intelligence test. It was standardized on adult groups as well as 
adolescents. Performance on the scale is not interpreted in mental 
ages. It extends upwards to advanced age levels. It yields not only 
a global measure, but also scores based separately on performance 
Subtests and verbal subtests, both of which it contains. And as 
Rapaport and others have shown it has considerable diagnostic 
efficiency which could probably be increased if further attention 
Were given to the profiles it yields. Of course, however, it has limi 
tations. It is an individual test, which implies various practical 
drawbacks. It has no decisive external validation, although in- 
Creasing experience with its use supports it. And it is by no means 
SO suitable as some other instruments for special purposes and 
Special groups, particularly for the prediction of academic success. 


Tests DESIGNED SPECIFICALLY FOR ADULTS 


1. Wonderlic Personnel Test * 

This is an adaptation specifically for adult use of the Otis Self- 
Administering Test of Mental Ability, Higher Form. The latter 
18 regarded as too easy for general adult use, and accordingly the 
difficulty level is raised and a more even order of difficulty pro- 
vided. The test is greatly abbreviated, and can be run in 12 minutes 
aS compared to 30 minutes for the Otis. In content, it is a 
Scrambled omnibus arrangement of 50 items. The title is chosen 
to avoid what the author believes to be the alarm engendered in 
taking an intelligence test. The self-administering feature is re- 
tained, 

Because of the specialized nature of the test, norms are not 
Computed for an unselected population. Instead, they are devel- 
ped for representative industrial and business groups, to wit, 
Outside representatives, clerks, managers of local offices of the 

Ousehold Finance Corporation, vacuum cleaner salesmen, typists, 
Hollerith key operators. Also there are norms for educational and 
Sex groups. Scores tend to fall with age, and a formula for com- 
Pensation is provided. It is found to have about the same reliability 
aS the Otis, with which it correlates from .81 to .87. Validation 
data based on occupational success indicate that, of managers 
Who score 2 5 or less, 78% fail. The mean for successful employees 


1S 29.7, and for unsuccessful 25. 
* Reference: Wonderlic and Howland. 
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2. Army General Classification Test * 
This is one of the basic intelligence tests developed in World p 
War II. A synoptic outline is presented in Figure 21. 


Part 1, Sentence Completion 
Incomplete sentences followed by five completing terms, one to be 
chosen. 


Part 2, Opposites 
Series of terms followed by five other terms, the one most nearly 


opposite to the first to be indicated. 


Part 3, Analogies 
Series of incomplete analogies followed by five terms, the one 


completing the analogy to be indicated. } 


Fic. 21. Army GENERAL CLASSIFICATION TEsT. 
SYNOPTIC OUTLINE 

The test has been developed in numerous forms in an extensive 
experimental process. Rather recently (1947) forms 1a and 1b 
have been released to the public domain. These and other forms 
have been given to over twelve million individuals. For evaluative 
purposes, scores have been divided into five categories, correspond- 
ing to five army grades, as shown in Table 21. 


TABLE 21 
RATINGS on Army GENERAL CLASSIFICATION TEST 


(Staff, Personnel Research Section, Adjutant General’s Office, 1947 P. 393) 


ARMY STANDARD SCORE RANGE 


Army Grade Through June 1942 From July 1942 
I 130 and higher 130 and higher 
il 110-129 110-129 
I go-109 go-109 
IV 70-82 60-89 | 


V 6g and lower 59 and lower 
s Office, 1947 


* Reference: Staff, Personnel Research Section, Adjutant General’ 
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The test is unsuited to those with less than a fourth grade school- 
ing. Standard scores of less than 5o should be disregarded as prob- 
ably due to illiteracy. Low correlations with rank in service have 
been reported (Duncan), and correlations around .55 with amount 
of education have been found. The test has been used for pre- 
liminary classification for various military occupations, and test 
ratings appropriate for many of them have been published 
(Harrell), 


3. Army Individual Test of General Mental Ability 


The new Army Individual Test of General Mental Ability is 
also of interest, which is heightened by a comparison with Army 
Alpha of nearly thirty years ago. The new test is an individual 
Instrument requiring 40 minutes’ time. It consists of three verbal 
Subtests (story memory, similarities-differences, vocabulary) and 
three nonverbal subtests (trail-making, cube assembly, shoulder 
Patches). Quite possibly it may come into wide use after suitable 
revision, for it has considerable specific military content and bias 
(v. Staff, Personnel Research Section, Classification and Replace- 
ment Branch, Adjutant General’s Office, 1944). 


4. United States Armed Forces Institute Tests of General 
Educational Development 

These might be considered educational tests, but their scope is 
So broad and their emphasis upon mental processes so definite 
that they can well be classified as tests of general intelligence. 

heir Purpose is general educational and vocational guidance, the 
educational placement of returning service men, and also the de- 
rmination of the educational status of those not intending to 
Continue their schooling. They deal with the interpretation of 
reading materials in the social studies and in the natural sciences, 

€ interpretation of literary materials, and also correctness and 
effectiveness of expression as shown in the making of corrections 
and improvements in printed passages originally well written but 
deliberately corrupted. There is also a test of general mathe- 
Matica] ability for high school level only which calls for the solv- 
Ing of various practical problems, and more specialized mathemati- 
Cal tests for college level. ' 

As may be gathered, the tests are not directed towards specific 
Content, but towards the power to interpret and evaluate written 
Material. This naturally calls for a background of substantial 
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knowledge, but the emphasis is upon generalized intellectual abili- 
ties. There are two batteries, one for high school level, the other 
for college level. There are two equivalent forms for each battery, 


one military, the other civilian. Norms for each-test are developed , 


on student populations enrolled in appropriate courses. For in- 
stance, the test in correctness of expression was standardized on 
students in freshman English; the test in social studies on stu- 
dents in survey ‘courses in the field, among others. Normis are pre- 
sented for three types of institutions classified on mean freshman 
scores for 1941 on the American Council on Education Psycho- 
logical Examination. The three types are institutions with mean 
scores of over 113, those with scores from 113 to 95, and those be- 
low 95. The tests are work-limit, not time-limit tests, as they are 
intended to measure power. Actually 120 minutes is ample time 
for most individuals on the college tests, and 95 minutes op the 
high school test. 

One notable technical development in psychometrics during 
World War IT has been the wide use of very brief tests for “screen- 
ing” purposes, i.e., for quick classification and disposal (v. Hunt 

» and Stevenson; Hunt, Wittson, and Harris). Such tests include 
abbreviations of the Wechsler-Bellevue scale, such as those by 
Rabin (q.v.), Geil (q.v.) and Gurvitz (q.v.), and also many others- 
The same tendency in connection with civilian testing has been 
noted, particularly with reference to tests by Otis, and also with 
reference to the Wonderlic Personnel Test. But the trend receive 
a great impetus in the armed services. The fact that such instru 
ments were used for screening purposes meets the common objec- 
tion that to be reliable and valid a test must be lengthy. 


5. Desirable characteristics of adult tests 
Stoddard (1943) on the basis of his work with the Iowa Place- 
ment Examinations which will be discussed later, and of broa 
practical experience, has set up the following requirements for 
any adequate program or battery for the measurement of @ ult 
intelligence: 
“A, General tests of adult intelligence 

(1) ‘Tests of general comprehension 

(2) A logic test 

(3) A test of plasticity (learning and retention of new 

material) : 
(4) A test of concepts of personality and behavior 
(5) A test of concepts of social responsibility 
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B. Special tests of adult intelligence (illustrative items) 
(1) Advanced reading and comprehension 
(2) Concepts in mathematics and the physical sciences 
(3) Concepts in the social sciences - 
(4) Concepts in the fine arts 
(5) Concepts in the humanities . 
(6) Concepts in applied arts, crafts, and vocations” 


(P. 157) ; š 


Solutions and constructions, not to be scored by a key but to be 
rated by a Committee of judges. The second would be a test 
revealing Power “to resist the strongest forces of suggestion and 
rationality available within the practical limitations of testing” 
(Pp. 155-6). Such a battery might yield a single global score, but 
Would certainly yield Profiles based on indices of the various 
attributes or manifestations of intelligence. This conception of an 
adequate scheme for the measurement of adult intelligence is the 
Clear Consequence of Stoddard’s description of intelligence, which 

as already been discussed in this book. As the reader will remem- _ 
ber, he thinks of intelligence as the ability to undertake activities v 
Characterized by difficulty, complexity, abstractness, economy, 

8Ptiveness to a goal, social value, the emer 
Persistence, and resistance to misleading distractions. 

‘his recommend 
atte. “ion has alrea 
"ty ob. ‘ndesirabili 2 
Ment in a global or over-all score, i.e., a mental age, or an intelli- 
Sence quotient, or a percent 
Practice is being very stringentl 
Metric discussions. As has been seen, there is already a tendency 
towards the construction of tests that do not yield such global 
Scores; or if they do, also offer profile scoring as a Substitute or 
alternative. Instances are the latest editions of the American Coun- 
cil on Education Psychological Examination for College Fresh- 
men, the California Test of Mental Maturity, and above all the 
Chicago Tests of Primary Mental Abilities, 

lobal scores which consolidate in a single measurement per- 

formance on a variety of items have proved reasonably satisfac. 
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tory in tests for school purposes. But, as has been argued here, 
this is because an effective institutional definition and criterion 
of general intelligence is present. That this is indeed the reason 
is suggested by the fact that such tests have proved far less satis- 
factory and convincing outside the educational milieu. Thelma 
Hunt (g.v.) summarizes a number of studies on the use of intelli- 
gence tests for vocational purposes. Of 195 business concerns re- 
plying to a questionnaire, only 17% used such instruments in their 
employment procedures. Of 36 states replying, 11 had some cen- 
tralized personnel agency, and of these 7 used intelligence tests. A 
sampling of reports on the relationship between intelligence test 
performance and vocational success gives correlations of .22 for 
stenographers, .34 for reformatory officers, .50 for patrolmen, .31 
for firemen, .28 for bank examiners. If these are truly representa- 
tive figures, it is clear that existing intelligence tests do not meas- 
ure effectively what it takes to do well in a great many jobs. But 
they are designed to measure primarily intelligence as it manifests 
itself and leads to success in the job of going to school. For this 
their global ratings have proved reasonably satisfactory. 

The argument, then, is for profile or specialized scores and rat- 
ings. Several times already we have noticed this trend in connec- 
tion with tests that have been discussed. And it is, indeed, the 


outstanding feature of the new types of tests now to be con- 
sidered, / 


Emercinc Types or Tests 


Thorndike (1928), in an appraisal of the testing movement uP 
to that time, foresaw the development of psychometric instruments 
which would measure well-defined mental functions, and woul 
center down exclusively on the functions as conceived. This is 1” 
contrast to tests of loosely defined general ability or general intel- 
ligence such as developed out of the work of Binet and out of that 
done in the United States Army program. There are three recent 
instances of just such tests now to be considered. 


1. a, b. Chicago Tests of Primary Mental Abilities * 


This exceedingly important battery appeared in two consider- 
ably different editions, so different, indeed, that they may almost 


* References: Thurstone, 1938; Thurstone and Thurstone, 1941. 
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be called different tests. The experimental edition was published 
1N 1938,.and the definitive edition in 1942. They represent a depar- 
ture in mental measurement, either the success or the failure of 
which will be momentous. 
he work was based on Thurstone’s monumental research in 
factor analysis published in 1938. He gave 54 tests of various 
kinds to 240 college students, obtaining in all about 1,500 correla- 
tion coefficients, These were subjected to analysis to determine 
the underlying factor pattern which would explain the interrela- 
tions of test performance. He identified in this way a number of | 
Primary mental abilities,” i.e., basic mental functions cutting 
across and involved in many different mental operations and types 
of test Performance. Those which received specific designations 
Were as follows. (a) P, i.e., perceptual ability. (b) N, i.e., numeri- 
cal ability, (c) V, i.e., verbal ability. (d) S, i.e., spatial visualizing 
ability, (e) M, i.e., memory. (f) I, i.e., inductive or generalizing 
ability, (8) D, i.e., deductive or reasoning ability. ais 
. The test battery built to measure these primary abilities con- 
Sists of 16 subtests in three booklets. Instead of a total or global 
score, it yields a profile showing the distribution in the subject of 
e seven basic traits or abilities. Since it was experimental, and a 
efinitive edition has now appeared, it will not be described in 
etail here, a 
he Second or definitive edition is shown in outline in Figure 
22. As will be seen, it consists of rr subtests instead of 16 as in 
the €xperimental edition. The list of primary abilities embodied 
and revealed in the tests has also been modified. Perceptual ability 
inductive ability do not appear. The designation W is new, 
2nd stands for word fluency, or the ability to think of and use 
Words rapidly and copiously. The designation D is altered here 
< œ Which stands for reasoning ability. This test, like the pre- 
Vlous One, yields a profile in terms of the designated basic mental 
abilities as defined. ve 
. Since the battery has not been out long enough for definitive 
Investigations to have appeared, such evaluation as can be made 
will Concern the experimental tests. But it probably applies to 
© former as well. , ; 
First of all, this clearly is a development in test construction 
Which is of major importance. Basic concepts are defined with 
Precision, The idea of a loose average sampling of general intelli- 
Bence js given up, and with it goes the familiar global score or 
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ro. 


II. 


* The letters indicate the primary ability involved in the test. 


ADDITION (N)* 
Sets of columns of numbers with total given, latter to be 
marked right or wrong. 
MULTIPLICATION (N)* 
Sets of multiplications with products given, same task as above. 
VOCABULARY (V)* 
Set of stimulus words with four words following, task to choose 
the word with the same meaning as the stimulus word. 
COMPLETION (V)* 
Problem statements calling each for one word in response. Five 


letters given, task being to indicate the initial of the. correct 
response word. 


FIGURES (S)* 
Stimulus figures each followed by six figures one of which is 
the stimulus figure in a new position, task being to select the 
stimulus figure. 

CARDS (S)* 
Stimulus pictures of cards in various geometrical shapes, each 
followed by six others, one being stimulus figure in novel posi- 
tion, task being to identify it. 


. FIRST LETTERS (W)* 


Writing down as many words as possible beginning with a given 
letter, 


FOUR-LETTER WORDS (W)* 
Writing down as many four-letter words as possible beginning 
with a given letter. 

LETTER SERIES (R)* 
Deciding what would be the proper next letter in various letter 
series. 

LETTER GROUPING (R)* 
Letter groups made on various principles, three groups in each 


series the same and one different, task being to identify the one 
different. 


FIRST NAMES (M)* 
Practice exercise on a series of first names connected with last 
names, then test choosing the right first name for given last 
name from seven alternatives. 


pee iii 
Fic. 22. CHICAGO TESTS OF PRIMARY MENTAL ABILITIES. 
SYNOPTIC OUTLINE 
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rating, whether it be of the nature of a mental age, a percentile, 
or a standard score. The hypothesis is that we are dealing with 
fundamental psychological components and processes which are 
to be found in many aspects of mental life and behavior. The 
individual is rated on the pattern or profile of these basic com- 
Ponents which the test reveals in the make-up of his mind. Just 
how certain it is possible to be that this particular list contains 
Such authentic component elements of mentality is a question 
that will be postponed until the technique of factor analysis by 
which it was obtained comes up for consideration. For the moment 
It is sufficient to remark that Thurstone obtained what he calls 
/ primary abilities from an analysis of test performance rather than 
from a direct experimental analysis of human behavior. This at 
least raises some doubt. 

As a practicable battery the instrument has been found some- 
what disappointing, in spite of defenses based both on theory and 
on its predictive efficiency (Crawford 1940, Shanner). It is very 
Ong, though this would hardly matter if it proved of superior 
Validity and worth. The tests are closely timed, and this puts a 
Premium on speed. Yet speed of response is not recognized as one 

the Primary factors revealed. The profiles are difficult for even 
an expert to interpret, and in effect unintelligible to the layman 
(Stalnaker, 1939, 1940). Thurstone’s position, in the past at any 
Tate, has been that the primary abilities are independent of one 
another, yet the following correlations have been obtained : Per- 
Ceptual ability with number ability, .50; perceptual ability with 
verbal ability, .6r ; perceptual ability with spatial visualization, 49 

Tawford, 1940). They certainly do not suggest independence 
Upon the variables. A ee 

AS to effective discriminative and prognostic value, in this it 
has not so far been proved superior to many more familiar well- 
Constructed tests. Goodman (1944), summarizing the literature 
UP to that date, concludes that it predicts college success as well 
aS most other intelligence tests, though it takes longer than many. 

ne of the more favorable studies is that of Ellison and Edgerton 
(9-2.), who report the highest subtest correlations with point-hour 
Tatio for college students as .44 for the verbal factor, the next as 
‘31 for the memory factor, and the multiple correlation for all 
factors as .64. Or again, the data in Table 22 do not demonstrate 
Superiority. In addition to this material, Bernreuter and Good- 
man (q.v.) find a multiple correlation of .49 between the battery 
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and semester marks for 170 freshmen engineering students, and 
subtest correlations running from .o4 to .38. 
2. SRA Primary Mental Abilities (Primary) * 


This battery, published in 1946, and intended for children of 
ages from 5 to 7 years, contains tests for five primary mental 
abilities, namely, motor ability, perceptual speed, verbal meaning, 
space, and quantitative thinking. This involves six of the eight 


TABLE 22 


CORRELATIONS BETWEEN PRIMARY MENTAL ABILITIES AND EDUCATIONAL 
ACHIEVEMENT 


(Bernreuter and Goodman) 


CORRELATION WITH ACHIEVEMENT IN 
Various FIELDS 
Sem- . 
ABILITY ester | Chem- | Draw- English Mathe- 
Aver- istry ing con po- matics 
age sition 
of 

P. Perceptual ability ..| .o4 07 .00 .05 204 
N. Number ability ....] 32 ay —.o1 26 227 
V. Verbal ability ..... .33 .32 „OI 44 16 
S. Spatial ability ..... 123 19 I „JII f 
M. Memory ssrsisssn TO .04 II .23 —.05 
I. Induction .........] s34 23 .18 „21 29 
D. Deduction ......... .38 4I a «21 44 
N VSID (together)...}  .51 
N VID (together).... -49 
N V MID (together). . .49 
N S I D (together).... 49 


primary abilities so far clearly defined in Thurstone’s work, sincè 

numerical ability and reasoning combine to manifest themselves 

in quantitative ability in children of this age. The test items are 

presented in graphic form. Thus verbal ability is measured 

tests in which the child indicates his understanding of word mean- 
* References: Thurstone and Thurstone, 1948: T. G. Thurstone. 
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Ings, sentence meanings, and paragraph meanings by indicating 
appropriate pictures. Spatial ability is measured by items requiring 
the child to select from a series of diagrams the one which will 
complete an incomplete square, and by items requiring him to copy 
in the element or elements omitted from an incomplete design from 
the complete design which is presented. Motor ability, which is of 
much importance at this age, is measured by drawing lines con- 
necting dots in parallel rows. The tests are designed for group 
administration, the elapsed time required being about one hour. 

he Primary battery is one of a series of similar batteries now in 
Preparation. The battery yields a total score, which is said to give 
a measure of the child’s general learning ability, and also scores 
on each of the five abilities involved. These scores can be con- 
verted into mental ages and into quotients, and the abilities scores 
Yield a profile. It is believed that the profile has much more 
~ 8nificance both for diagnosis and guidance than the total score. 

hus the verbal-meaning and perception scores are regarded as 
Closely related to reading readiness, and the manual presents a 
8eneral discussion showing the advantages of profile ratings over 
global scores, 


3. California Test of Mental Maturity * 


This important battery is yet another instance of a test built 
about sharply defined and delimited concepts. It has gone through 
Several revisions since its first publication in 1937. It is set up for 
use at five levels, namely, preprimary, primary, elementary, inter- 
Mediate, and advanced. An important feature of the battery is 
that it Provides pretests for visual acuity, auditory acuity, and 
Motor coordination. A synoptic outline of the mental maturity 
tests themselves, omitting the three pretests, is presented in 

igure 23. ee 

The factors about which the battery is built in its present from 
are immediate and delayed memory, spatial relationships, logical 
reasoning, numerical reasoning, and vocabulary. The authors have 
Proceeded on the multi-factor assumption, i.e., that the significant 
Constituents of mentality consist of more or less separate primary 
abilities, rather than of a general factor together with group and 
Special factors. Also they believe that global measures conceal 
Much about a subject’s mentality that is of great importance. 

The battery yields profiles based on the indicated factors. Also 


* References: Maxfield, 1937; Tiegs; Traxler, 1937; Traxler, 1939. 
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it yields three kinds of M.A?’s and three kinds of 1.Q.’s, namely, 4 
language M.A. and 1.Q., a nonlanguage M.A. and IQ., and a 
total M.A. and L.Q. (Traxler, 1937, 1939; Tiegs). 


Test 4. MEMORY (immediate recall) 
‘A series of word pairs given vocally, with a set of 3 pictured objects 
corresponding to each word pair. Task is to identify object cor- 
responding to the first word in the pair. 


Test 5. MEMORY (delayed recall) : 
A story or expository passage read aloud to subjects. This reading 
comes at a late point in the battery, and after it other tests are run. 

. After the interval involved, multiple choice items on the material 
are given. 


Tests 6, 7,8. SPATIAL RELATIONSHIPS 


Tasks involving discrimination between right and left, the manipu- 
lation and transposition of geometrical forms, maze problems, etc. 


Tests 9, 10, 11, 15. LOGICAL REASONING R 
Tests 9, 10, and 11 are nonlanguage reasoning tests, presenting 
tasks in graphic form. Test 15 is a verbal test of logical reasoning; 
requiring the drawing of formal inferences, etc. 

Tests 12, 13, 14. NUMERICAL REASONING 
A variety of reasoning problems involving the use of numbers. 

Test 16. VOCABULARY 
Fifty 4-choice items calling for the interpretation of words. 


Fic. 23. CALIFORNIA Test oF MENTAL MATURITY. 
Synoptic OUTLINE 


Reported reliabilities for all factors are in the nineties or high 
eighties. Educational, business, and industrial uses of the battery 
are indicated in the manual. Percentile age norms and percentile 
norms for various populations are presented for the various 
factors. Experience and the reports of investigation have not yet 
accumulated sufficiently for any dependable evaluation to be pos- 
sible. However, certain serious questions arise. R. Cattell (1942 a) 
remarks that while the authors believe that they have clarifie 
matters by avoiding the “mysterious” idea of general intelligences 
it is quite possible that they have introduced another mystery just 
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as great. Kuhlmann (1939-40) has expressed doubts as to the value 
of labeling tests by the functions they measure or are supposed to 
Measure, because no one can tell what those functions are by in- 
Spection. Moreover, the practice can easily lead to absurdities that 
have their risky side; as, for instance, if one concludes that a 
child is a “good reasoner” with a “poor memory.” 


APPRAISAL OF INTELLIGENCE TESTS 


Having concluded our survey of intelligence tests, it is appro- 
Priate to consider the question of appraisal and of standards and 
techniques of evaluation. 


l. Significant Trends in Intelligence Testing 


What are the chief lines of development that have manifested 
€mselves in intelligence testing, since the inception of individual 
testing in the work of Binet, and the formative work done in 
Stoup testing in World War I? 
he advance in efficiency is undeniable. Greater ease in adminis- 
tration has been achieved, notably by the adoption of the spiral 
omnibus type of organization in the place of separate subtests. 
his makes it possible to mass instructions at the beginning of 
i ie test, together with some necessary practice, and does away 
with the Practically difficult problem of fine timing, which also 
rises theoretical issues, The spiral omnibus organization has been 
both defended and attacked. The argument in favor of it is that 
requires the subject to make frequent and rapid adjustments, 
and that this is considered one of the important aspects of intelli- 
8ence. But this is probably farfetched. The argument against it 
iS that it makes for boredom and reduces motivation, which is 
Stepped up by a sequence of brief subtests effectively introduced. 
There is no significant evidence either way, and the truth prob- 
ably is that the specific psychological effect of the device is 
Negligible, 
. As another factor of efficiency, too, scoring methods have been 
™MProved. Stencil scoring and machine scoring have become widely 
Used. But it is possible that there is a genuine objection here. A 
test Setup for stencil scoring or machine scoring must greatly 
Testrict the responses of the subject. He must make a mark, or 
Underline a word, or write in a word or other symbol in the right 
box, or perhaps punch a hole with a stylus. Items which can be 
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thrown into shape for such responses are bound to be restricted, 
and the general effect is to make tests increasingly rigid and cut 
to pattern for convenience. 

Yet another factor of increased efficiency is the development of 
short tests, primarily for screening purposes. Of this the Wonder- 
lic Personnel Test and many of the new tests developed in the 
armed services are excellent instances, as are also the abbrevia- 
tions of the Wechsler-Bellevue scale by Gurvitz and Rabin. This 
trend is significant and widespread (v. Conrad), and may be said 
to reach a high point in the single-item tests described by H. M. 
Hildreth (q.v.). These single-item tests, which of course are still 
more brief and compact than the more usual adaptations of exist- 
ing instruments, were used at a naval training station for rapid 
mental screening, the purpose being not comprehensive measure- 
ment, but simply an assurance that subjects did not fall below @ 
certain minimum level of ability. In one such test the problem was 
to give the products of 7 times 7, 8 times 8, 9 times 9, 10 times 10, 
1r times 11, and 12 times 12, and a passing rating was made on 
not more than two errors. Another consisted of the following two 
questions : “Why does the Moon look bigger than the stars? What 
time of day is your shadow shortest?” Full credit was given if 
both were correct. 

There has also been an advance in general statistical adequacy- 
As to genuine psychological or even statistical improvements, 
the showing is not nearly so favorable. With the best tests, such 
as those discussed above, standardization is careful. As to its 
adequacy, it is usually based on quite large groups, though the 
selection and composition of such groups that have the vital func- 
tion of acting as true samples of mental ability may raise some 
questions. Perhaps purchasers and users of tests have become suf- 
ficiently sophisticated to be impressed by standardization groups 
running into the thousands. Highly dubious interpretations, such 
as the derivation of necessarily shaky grade norms are on the 
whole avoided, though some peculiarities can be found, such as 
the derivation of the Otis intelligence quotient, However, the 
standardization pattern established more than twenty years ag° 
is still being followed substantially, whatever may be its psycho- 
logical worth. 

A problem connected with standardization that has not bee? 
solved, and indeed hardly attacked in most test construction, turns 
on the relative difficulty of the items, If many of the items in 
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Subtests, or in a whole test organized on the spiral omnibus plan, 
are of about the same level of difficulty, the score means one thing. 

they are in ascending order of difficulty, the score means some- 
thing very different. Quite insufficient attention has been paid to 
this point, even in the best commercial tests, in spite of the stand- 
ing monition contained in the work of Thorndike on the LER, 
Intelligence Scale CAVD. 

Item selection throughout the period has on the whole been 
80verned by essentially the same considerations—correlation with 
the test as a whole, power to discriminate between subjects 

known” to be bright or dull on some external criterion which is 
Often rather vague, and “opinions of experts.” The idea of select- 
ng items for the building of subtests which will correlate high 
With total Scores and low with one another, which was at least 
recognized as an ideal in the Army work, has on the whole been 

'Sregarded except in principle; and, of course, the adoption of 
.€ Spiral omnibus setup undercuts the whole issue. Item content 
Impresses one by its extreme uniformity. The same items appear 
again and again with minor variations, relieved here and there by 
a few ingenious novelties of which the psychological significance, 


if any, is unknown. p 

© far as this last statement is concerned, the one great qualifi- v 
Cation is the increasing application of the techniques of factor 
analysis. When it is proposed to construct a battery of tests whose 
Component subtests are factorially pure, i.e., measure one and 

nly one definable mental factor, a new and distinctive method 
Of item selection at once appears. The application of factor theory 
is, indeed, the outstanding psychological development in this 
Whole field, as contrasted with increases in efficiency and ims 
Provements of already familiar techniques. Kornhauser (1945 a) 
Teports g very decided trend of opinion among psychologists in 
avor of profile scores as contrasted with global scores such as 
Percentiles, M.A.’s and I.Q.’s. It is, however, too early as yet to 
Say with confidence how much this means in the way of a tangible 
™Provement of tests. 

OWever, there cannot be the least doubt that the best tests, 
Properly used and conservatively interpreted, are extremely serv- v 
iceable instruments. On the basis of experience as well as of formal 
investigation, they have proved their value as practical tools of 
8uidance, This, as Wechsler and others have pointed out, is the 
ultimate validation. The psychologists whose views were polled 
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by Kornhauser (1945 a) were asked to rate intelligence tests as 
meeting practical needs (a) in the Army, (b) in schools, and (c) in 
business. Consolidating all responses which gave ratings of ex- 
tremely well, rather well, or better than without tests, the per- 
centages of votes were (a) 88%, (b) 67%, and (c) 67%. Durflinger 
(q.v.), in a limited but rather interesting study, finds that median 
correlations between intelligence scores and college marks have 
risen from .45 in 1934 to .52 in 1943, and suggests that this may 
indicate an improvement in testing, among other possibilities. Of 
course such figures are very far indeed from being decisive, and 
if there has been a general over-all improvement, it is probably 
quite slight. Indeed, it is possible to ask whether our present-day 
tests, in spite of their greater convenience and efficiency, actually 
predict more surely and significantly than those of twenty years 


ago, which is certainly a limiting factor on their practical valida- 
tion. 


2. Agreement and disagreement among tests 


A. A very considerable but by no means perfect agreement 
among the results of different verbal tests when applied to the 
same subjects has been reported in numerous studies. Thus Guiler 
(1921-22) reports correlations of .85 between the Stanford-Binet 
and the Illinois Intelligence Examination, of .75 between the 
Stanford-Binet and the National Intelligence Tests, and of .8% 
between the National Intelligence Tests and the Illinois Examina- 
tion for a group of subjects in grades 6, 7, and 8. Then, some 
twenty-five years later we have such reports as that of Sartain 
(q.v.), dealing with the Revised Stanford-Binet, the Wechsler- 
Bellevue, group intelligence tests, and of Traxler (1945), deal- 
ing with the Otis Self-Administering Test and the America” 
Council on Education Psychological Examination, and giving 
closely comparable figures. These are representative results, 20 
although a very great many more of about the same order, with 
only minor variations, have appeared, it hardly seems worth while 
to labor the point by reproducing them, What such results seem tO 
indicate is that verbal intelligence tests have been rather cor- 
ad built about a similar concept translated into similar 
items. 

B. However, it must be remembered that such correlations reP“ 
resent central tendencies or mean trends. Such mean relationships 
may be quite high, and still there may be much variability. THS» 


TS 
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in fact, is the case. Thus P. Cattell (1930) finds variations in 
Individual performance as between different tests specially marked 
in Upper extreme cases. As between the Otis Self-Administering 
Test and the Stanford-Binet, the mean difference at 1.Q. 70 was 
only .4, but at 1.Q. 130 it was 15.8. High correlations, it must be 
Tecalled again, mean stable relative standing, not equal scores, 
Gates (1923), again, found that I.Q.’s for the same pupil on six 
Standard intelligence tests may range from 104 to 144, while mean 
Class 1.Q.’s may range from 109 to 129. Similarly, Miller (q.v.) 
Save ten intelligence tests to a group of 57 university freshmen, and 
obtained 1.0.’s ranging from 117.5 on the Stanford-Binet to 138.5 
on the Miller Mental Ability Test for the same individual. The 
Cear conclusion is to beware of making absolute ratings on any 
Single test, although the relative standings it reveals may be 
Significant enough. ; 

C. Such differences in the rating of a given individual or group 
on different tests are accentuated and become a matter of constant 
expectation when comparisons are made between standings on 
verbal and performance tests. Thus Seagoe (q.v.) found that a 
test like the Pintner-Cunningham Primary is likely to yield I.Q.’s 
from 5 to rr points higher consistently than such tests as the 
Terman Group Test of Mental Ability or the National Intelligence 
Tests, So, too, the Pintner-Paterson Scale of Performance Tests 
and the Arthur Scale of Performance Tests yield ratings consist- 

ntly higher than those dependent on standard verbal tests. Corre- 
lations between performance type and verbal type tests are signifi- 
tantly lower than those between various good verbal intelligence 
tests. Thus it seems clear that a different conception of intelligence 
‘S involved— different perhaps not so much in the words in which 
it might be framed as in the items into which it is translated. 

D. There seems little doubt that the main reason for variations 
Among tests is that their norms are based on different standardiza- ~ 
tion groups, Various attempts have been made to overcome this. 
Thus Kefauver (q.v.) gave several tests to the same group and 
Worked out standard scores based on its performance in them. 
Steckel (q.v.) restandardized the Kuhlmann-Anderson test, the 
Otis Group Test of Intelligence, Intermediate Examination, and 
the Otis Group Test of Intelligence, Advanced Examination, on 
10,799 children in the schools of Sioux City, Iowa, who constitute a 
reasonably homogeneous population, and derived percentile rank- 
ngs of 1.Q.’s which make inter-test comparisons possible. Cole 
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(q.v-), again, has worked out and reported elaborate eamyersi on 
tables which equate the scores made on the Terman Group Tes 

of Intelligence, the Otis Group Test of Intelligence, Advance 

Examination, and the Otis Self-Administering Test of Mental 
Ability. Obviously, one of the most crucial of all points of test 
construction is involved; namely, the use of a standardization 
group as a true and sufficient sample from which general conclu- 
sions can be drawn. This cannot help but make difficulties, and 
there is no way of avoiding them, for even the three comparab e 
restandardizations just described are themselves based on some 
specific standardization group (see also Runnels). 


3. Comparative evaluations of tests 


Some attempts have been made at relative appraisals of various 
tests—to decide which are better and which are worse—but with- 
out much broad success. 

A. There have been comparisons between various group tests 
and the Stanford Revision of the Binet scale, with varying an 
ambiguous outcomes so far as the relative merit of the tests com- 
pared is concerned. Thus Turney and Fee (q.v.) rank the OtS 
Self-Administering Test (Intermediate), the Terman Group Test, 
the National Intelligence Tests, and Haggerty Delta 2 in the 
stated descending order in terms of their agreement with Stan- 
ford-Binet I.Q.’s for the same group of subjects. The differences 
however, are not great. Nor is it clear why the Stanford-Binet sc@ a 
should be accepted as a norm of excellence. . 

B. Another attempted criterion has been the intercorrelatio® 
of a test with a battery of other similar tests. Again, Turney an 
Fee report the Otis Self-Administering Test (Intermediate); Hag- 
gerty Delta 2, National Intelligence Tests, Terman Group Tests 
McCall Multi-Mental Scale in the stated descending order 1® 
terms of mean intercorrelations with the whole battery- But the 
coefficients are .76r, .756, .755, -753, and .695, so that the true 
differences in relationship are trivial. Also, the point arises 
if we had a genuinely and markedly superior test it would Pt 
sumably not correlate well with other and inferior instruments- 

C. Yet another criterion occasionally used is the distributio” 
of scores yielded by a test. It is commonly held that a small 0f 
relatively limited spread of scores is preferable to a large one 
As a universal proposition, however, this is open to considerable 
question. Kuhlmann (1939); as we have seen, prefers a wide dis- 
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tribution of scores, and very justly remarks that there is no reason 
to suppose that a test which yields it is not revealing the true 
facts. 

D. Another criterion has been the relationship of the scores on 
a given test to some external standard, nearly always school or 
college achievement. Thus Jordan (g.v.) reports the correlations 
between certain test scores and high school marks as follows: 
Army Alpha, .38 to .41; Otis Group Test, average .66 with a range 
of .33 to .o1; Terman Group Test, average .47 with a range of .30 
to .67. Guiler (1927) reports the Terman Group Test, the Otis 
Self-Administering Test, and the Ohio State Examination as cor- 
relating with college grades respectively .52, .49, and .47. It is 
noteworthy that correlations between intelligence scores and scores 
on such a broad instrument as the Stanford Achievement Test 
run higher than those between intelligence scores and marks, the 
obvious reason being that marks are rather low in reliability. In 
general, the wide spread of obtained correlations and the con- 
flicting mean correlations reported by various studies make test 
evaluation on this basis impossible. The nearest approach to some- 
thing significant here is the finding reported by Gates and La Salle 
(q.v.) that Stanford-Binet ratings maintain about the same rela- 
tionship to educational achievement over increasing intervals of 
time, whereas the relationship of ratings on all the other tests they 
Studied drops with the passage of time. a 

E. Kuhlmann (1928) has approached the question in a some- 
what different way. He compared seven standard intelligence tests 
with the Kuhlmann-Anderson Intelligence Tests in terms of their 
discriminative capacity. The assumption was that a good test 
Should show marked advances with age and grade, and the less 
Overlap the better. He was able to show that the Kuhlmann- 
Anderson test was markedly superior in these respects. This is 
Perhaps the most impressive comparative study published. 

So one cannot make confident general statements about the 
Superiority or inferiority of the group tests we have been con- 
sidering, Of course, it is possible to criticize methods of con- 
Struction, standardization, etc., but this is a different matter from 
direct comparative ratings. Also, the story is different if we have 
Special purposes in mind. Army Alpha and the Otis Quick-Scoring 
Tests are too easy for college populations. The Kuhlmann Tests 
of Mental Development and the Kuhlmann-Binet scale are better 
as diagnostic instruments than the Stanford-Binet scale, and so 
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is the Wechsler-Bellevue scale. The latter has an outstanding 
superiority for use with adults (v. Rapaport and Others). A test, 
in short, is a practical instrument, to be used with specific groups 
with specific purposes in mind, not an instrument capable of 
absolute measurement. And in general it would seem that all care- 
fully and competently constructed group tests are on about the 
same level of excellence. 


4. What group intelligence tests measure 


That group intelligence tests, and more particularly those of 
verbal type, are committed to a translation of general intelligence 
substantially into terms of academic ability, has long been fairly 
clear. The material they use is drawn largely from curricular 
sources, and performance is undoubtedly influenced by learning 
similar tasks in school. Thus Bishop (q.v.) set up 4 groups, 2 of 
them from grades 7 and 8 and designated as A and B, 2 of them 
from grades 9 and ro and designated C and D. These groups were 
equated in intelligence on the Otis Group Test. Lessons wer 
devised to parallel the ten pages of the test, not containing the 
same material but similar in principle. Group A was taught the 
first 5 lessons, group B the second 5, group C all 10, and group D 
none. When the study period was over, they were retested. GrouP 
A gained 40% on the first half of the test and 6% on the second 
half. Group B gained 31% on the second half and 15% on the 
first. Group C made an over-all gain of 30%. Group D gained in 
all 11%. In other words, parallel though not identical teaching 
greatly affected test performance. And there is no question but 
that the curriculum contains much material which at least in # 
general way parallels the content of group mental tests. More 


over, one should notice the very strong tendency to use schoo‘ / 


groups both for standardization and for validation. So the tests 
are certainly closely related to school work and achievement: 
However, as previously argued, this does not mean that they 2° 


not reveal mentality but only special aptitude, unless one is Pre 
pared to say that succeeding in school calls fi i 


I or a special and lim- 
ited talent. ý 
SUGGESTED ADDITIONAL READINGS 


For additional reading and more intensive study of the material i 
this chapter the most important sources are the tests and more P2!” 


Í 
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ticularly the manuals of the tests discussed. Publishers will be found 
j listed in the bibliography of tests at the end of the book. Also the 
ei Teferences mentioned in the text in connection with the various tests 
may be consulted. Further readings and suggestions are as follows: 
> By Anderson, “The limitations of infant and preschool tests in 
the measurement of intelligence,” Journal of psychology, 8 (1939), 
351-79. A critical review of such tests and a broad discussion of their 
limitations, 
| Beth L, Wellman, The intelligence of preschool children as meas- 
ured by the Merrill-Palmer Scale of Performance Tests (University 
of Towa Studies in Child Welfare, vol. 15, no. 3, 1938). To a con- 
Siderable extent a “test of the test.” See pp. 144-48. 
| =- W; Richards, “The relationship of psychological tests in the 
first grade to school progress; a follow-up study,” Psychological 
& Clinic, ar 1932), 137-71; also “Psychological tests in the first 
í grade,” Psychological clinic, 21 (1932), 235-42. The second article 
1S a continuation of the first. A general survey of a number of first 
Stade tests, 
David Wechsler, The measurement of intelligence (Baltimore: The 
illiams and Wilkins Company, 1944), Chapter 2, “The need for an 
adult intelligence test.” 
Oscar Krisen Buros (editor), The 1938 mental measurements year- 
book (New Brunswick, N. J.: Rutgers University Press, 1938); also 
te 1940 mental measurements yearbook (Highland Park, N. J.: 
The Mental Measurements Yearbook, 1941). These valuable refer- 
ence works should be consulted for test reviews and bibliographies. 


p= 


| QUESTIONS FOR DISCUSSION 


I. Examine the statistical data here presented on the values of 
tests for early childhood, supplementing it if possible with data from 

i, references, and formulate the general conclusions that seem indi- 
cated, citing specific material. s 

2. Comptes Te status and stability of Merrill-Palmer I.Q.’s with 
that of Stanford-Binet I.Q.’s and Wechsler I.Q.’s. 

3- Discuss the relative importance of the factors tending to lower 
oe Stability and prognostic value of tests for early childhood men- 
ality, 

4. Examine the item content of the tests here presented in synoptic 

} Outline, To what extent do the items seem indicative of intelligence? 
; What Other factors might they indicate? , i s 
5. Would a child’s reactions to a test situation, e.g., negativism, 


be themselves significant? If so, of what? 
Does it appear to you significant that as soon as tests go either 
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above or below the school ages their efficiency seems to be lower? 
Consider some reasons for such a phenomenon. , 

7. Might one properly argue in the light of the data and discus- 
sions here that intelligence tests, even at the best, do not measure 
intelligence at all? If so, what might they really measure? 

8. To what extent do Stoddard’s recommendations for an adult 
testing program seem to you adequate? Would you supplement them 
in any way? 

o. Might some such program be used for personnel work? For other 
purposes? Specify and discuss. 


10. What peculiar dangers might result from a wholesale adoption 


of profile ratings? Might this lead the whole testing movement into 
disaster? 


CHAPTER VII 


APTITUDE TESTING 
Tue Basic CONCEPT 


An immense amount of work has been done in the huge and 
loosely bounded area of aptitude testing. This work has gained 
In general significance and to some extent in actual scope due to 
the increasing dissatisfaction with the global ratings obtained by 
Intelligence tests. Here, as always, the basic problem is the search 
for and the practical definition of the conception of the aptitude 
to be measured. 

The term aptitude has been defined many times. Two such 
definitions are here cited. According to Bingham (q.v., p. 18), 

aptitude, then, is a condition symptomatic of a person’s general 

tness, of which one aspect is his readiness to acquire proficiency 

—his general ability—and another is his readiness to develop an 
Interest in exercising that ability.” According to Freeman (1939, 
P. 182), “an aptitude is the ability or collection of abilities re- 
quired to perform a specified practical activity.” Freeman points 
out that an aptitude is not to be thought of as necessarily innate, 
Which is an important systematic caution. But on the other hand, 
It is not a direct product of special training. Thus an aptitude for 
Machine design is not the product of training in machine de- 
Sign, but the ability, among other things, to profit by such 
training, x 

Certainly we must avoid thinking of aptitudes as faculties, or 
unitary mental entities. Rather, they must be considered as dy- 
namic trends of the whole personality. 

_ All this is certainly vague enough, and the various characteriza- 
tions take in a great deal of territory. Indeed, the boundary lines 
ate anything but clear. Words such as aptitude, talent, special 
ability, trait, and so forth, are constantly used in overlapping 
Senses, and the differences between them are hazy. It is the present 
Writer’s decided opinion that attempts at pedantic clarity of defi- 
nition here do very little good, and may easily do harm. Neat 
Classifications of existing instruments €f measurement into apti- 
tude tests, talent tests, tests of special ability, and so forth, really 
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tell very little and only exercise the mind to small purpose. Con- 
fusion, in all conscience, is quite great enough without adding to 
it by a scholastic concern for the precise meaning of words. This 
is why, in the treatment here presented, a great variety of tests 
are lumped together under the single broad classification of apt- 
tude. The general idea is that there is something in the mental 
organization that makes one good at clerical work, or mechanical 
work, or science, or mathematics, or administrative or military 
pursuits, or art, or music. Whether one wants to call this some- 

_ thing an aptitude, or a talent, or a trait, or a special ability, oF 
something else, is not of much moment. What is actually before 
us is a wide diversified endeavor to reduce these “somethings” tO 
terms of practicable measurement and prognostication. 

In general, there are two methods of doing this, though most 
workers with a job of test construction on their hands may not 
confine themselves to one and exclude the other. (a) The job oF 
pratical function in which the aptitude expresses itself may 
analyzed: to single out its psychological components, perhaps 

factor analysis, perhaps by simpler means, and then test items 
are organized to reveal these components. The extreme applicatio? 
of this procedure is found in the so-called work sample test, made 
up of actual samples of the activities involved in the job, 
activities very closely analogous to them. (b) A much more general 
psychological analysis of the ability or aptitude in question may 
be undertaken, and this again translated into test items. 

Good examples of both procedures will be found in this chapte": 
Instances of the former are the Minnesota Test for Clerical Work- 
ers and the Orleans-Solomon Latin Prognosis Test, which are 
made up of tasks closely similar to those in clerical occupation” 
and the learning of Latin. An outstanding instance of the Jatte! 
is the Seashore Measures of Musical Talent, which turn entirely 


on a psychological conception of musical ability, and contain ng 
items from musical activity itself. ; 


' é 
Measures or MOTOR ABILITY 
Numerous attempts have been made to single out motor ability: 
\ which, be it noted, is a much more restricted function tb2? 


mechanical aptitude, as a definable and measurable aptitud® 


Typical instances of the tests that have resulted are 
follows. 
+ 


i 
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1, Finger Dexterity Test * 

The test is intended to measure moderately fine digital adjust- 
ment and control. It consists of a modified peg board. The material 
Is a large metal plate with roo holes, and also 300 metal pegs one 
inch long, which fit three each into the holes. They are to be 
Picked up three at a time and placed in the holes till all are filled. 
The score is the time taken. This test has had a fairly wide use. 
Its reliability, validity, and general significance are none too clear. 
The norms worked out and reported by O’Connor are vague. Still, 
it has been found serviceable in testing small motor reactions. 


2. Tweezer Dexterity Test} 

The task is to use a pair of tweezers to place small metal pins, 
one by one, in roo holes in a metal plate. It is scored on time in 
Seconds, i.e., the number of seconds between placing the first and 
last pins. It has a satisfactory reliability. Norms are worked out 
and reported. Thus on a standardization group of men, the score 
Of 255 produces a standard score of 7.5, a percentile score of 
99.4, and a letter grade of A. A score of 615, indicating much 
Slower performance, corresponds to a standard score of 2.5, a 
Percentile score of 0.6, and a letter grade of E—. 


3. Minnesota Rate of Manipulation Test} 
The material consists of a long board with 60 round holes in 
4 rows, rs holes to a row, and the same number of cylindrical 
blocks of diameter one-sixteenth inch less than that of the holes. 
The placing test consists of putting the blocks into the holes with 
One hand. The turning test consists of taking them out with one 
and, turning them over, and replacing them with the other hand. 
score is on the time required. Bice 
The test is useful for occupations and activities that require 
Speed of gross movement, which it measures—package wrappers, 
Packet stuffers, assembly-line workers, possibly typists. 


4. Stanford Motor Skills Test § 
This battery consists of 6 serial dexterity tests, selected from 
among 20 tentatively considered, for the following reasons. (a) 


* References: O’Connor, 1928, 1938; Bingham; Green and Berman. 

References: O'Connor, 1928, 1938; Bingham; Green and Berman. 
eferences: Bingham; Green and Berman. 
Reference: Robert Seashore, 1926, 1928. 
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Adaptation to school and factory use. (b) Economy of time, the 
total time required being less than 2 hours. (c) Compactness and 
practicability of the needed material. (d) Automatic scorability- 
(e) High retest correlations. The retest coefficients reported run 
from .75 to .86. (f) Low correlations with the Thorndike Intelli- 
gence Examination for High School Graduates. (g) Low correla- 
tions with special training in motor skills, eg., in typewriting, 
playing musical instruments, and athletics. (h) Low intercorrela- 
tions among subtests, the mean being .25. 

The subtests of the battery are as follows. (1) Koerth Pursuit 
Test. The subject holds a metal stylus on a metal disk one-hal 
inch in diameter mounted on a phonograph turntable set for one 
revolution per second. Thus the disk follows a circular path. The 
score is the distance through which contact is maintained for 2° 
consecutive seconds. There are ro trials allowed. (2) Seashore 
Motor Rhythm Test (v. R. Seashore, 1926). This test requires 
the subject to tap out various rhythmic patterns dictated in taps 
using a stylus with electric contact which records the result. It 5 
scored on number of successful reproductions. (3) Tapping Speee- 
The subject presses and releases a telegraph key as fast aS ne 
can for a period of 5 seconds, a record being made of the result. 
Two trials are given. (4) Serial Discrimination. Four numbers are 
exposed (1, 2, 3, 4), and the subject responds by pressing the 
appropriate key out of 4 before him, The numbers are expose’ 
visually in random order. The score is the number of correct 
responses in 2 minutes. (5) Brown Spool-Packer Test. This cop 
sists of packing spools in a small box, using both hands. The score 
is the number packed in 3 minutes. (6) Miles Drill Test. Consist 
of rotating the handle of a small drill as fast as possible for io 
seconds. Score is number of rotations. There are 3 trials. 

With all these and similar tests the great difficulty is the narrow 
range of their validity. They do not reveal or measure a genera 
aptitude or factor of general motor efficiency. In all probability 
there is none. Thus Perrin (q.v.) in a crucial experiment gave p 
battery consisting of rr simple and 3 complex motor tests t0 about 
so subjects. The simple tests dealt with tapping, reaction times 
and so forth. The complex tests were as follows. (1) Bogardus 
Fatigue Test, i.e., placing a block correctly again and again on 4 
rotating platform. (2) Sorting cards into holes or piles accordin 
to marks on the cards. (3) Motor coordination, the task being t° 
draw a square with one hand and a triangle with the other simut 
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taneously. He found virtually no intercorrelations and concluded 
that there is no such thing as general motor ability. The attempt 
at rebuttal and defense made by Garfiel (g.v.) is not convincing, 
for what she showed was that ratings by judges on the general 
motor ability of subjects show high correlations, which seems to 
demonstrate something about the rating rather than the ability 
itself. Buxton (q.v.), again, ran 9 motor tests with 76 boys, the 
tests including 2 for motor steadiness, 3 of tapping speed, packing 
spools, packing cubes, rotor mobility, and rotor pursuit. A 
factorial analysis revealed no general component of motor 
ability, 

Of course, it seems very reasonable, and also very inviting, to 
believe that human beings can be classified in terms of a general 
Motor efficiency manifesting itself in all their doings. If this were 
Possible, it would be very useful. However, our existing psycho- 
Metric instruments do not make it possible, and there is reason to 
doubt even its theoretical possibility. The tests we have seem to 
Possess little significance beyond themselves and their immediate 
and obvious applications. This is even true, in all probability, 
of the motor rhythm test, although “the importance of being 
thythmic” is to many persons a golden thought. Jersild and 
Bienstock (q.v.), using a motion-picture technique, made very 
accurate measurements of children’s clapping and stepping to 
Music. Quite possibly this might be more revealing than the 
Seashore Motor Rhythm Test, which calls for the reproduction of 
thythms in clicks, since the function seems more meaningful. But 
the correlation between rhythmic efficiency and singing was only 
-30. Motor tests, then, are serviceable for judging promise in con- 


nection with activities, jobs, and functions with which they are 
directly related. But they do not seem to reveal any such factor 
as general motor efficiency. This may be because of the limitations 
of the tests, or because no such factor exists. 


Tests or MECHANICAL APTITUDE 


Mechanical aptitude refers to a higher level of organization 
than manual aptitude. It involves not merely dexterity, but the 
ability to understand and solve problems involving mechanical 
relationships and arrangement, such as those which occur in the 
adjustment, repair, and assembly of machinery. The tests that 
have been developed in this field belong to two classes. (a) Those 
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which call for the actual manipulation and assembling of mechan 
ical objects. (b) Paper-and-pencil tests, calling for a knowles gn 
of parts of machines, an understanding of mechanical cig ae 
ship, and so forth. A good many of the items used in all such tes 
have some resemblance to those appearing in performance tests. 
Typical examples of both kinds are given below. 


1. Stenquist Assembly Test of General Mechanical Ability 


The test consists of assembling two series each of 10 small 
objects such as may be purchased in the local stores. The sae i 
are presented disassembled. The score is the number of artic 
correctly assembled in 30 minutes. These two series, which we 
shown in Figure 24, are intended for subjects from the sth gra! 


level to adult. A third series has been added intended for subjects 
from the 3rd to the sth grade. 


SERIES I SERIES II 
1. Cupboard catch 1. Sash fastener 
2. Chain 2. Rope coupling 
3- Mousetrap 3. Defiance paper clip 
4. Hunt paper clip 4. Expansion nut 
5. Bicycle bell 5. Double-action hinge 
6. Shutoff 6. Calipers 
7. Lock no. 1 7. Elbow catch 
8. Push button 8. Lock no. 2 
g. Clothespin 9. Expansion rubber stoppe" 
to. Wire stopper to. Pistol 


Fic. 24. ARTICLES To Be ASSEMBLED: STENQUIST ASSEMBLY Test 
OF GENERAL MECHANICAL ABILITY 


The test has been validated on the criterion of ratings i 
mechanical aptitude by teachers of shopwork and science. Usua y 
classes were selected in which ratings could be secured from pt 
teachers. Obtained correlations are decidedly high, as will be see 
from Table 23. E 

The test appears virtually unrelated to general intelligence, %7 
shown by the correlations presented in Table 23. The only ©? 


efficient of even moderate size is that obtained between the Ste” 
* References: Stenquist; Hunt. 
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TABLE 23 


CORRELATIONS BETWEEN SCORES on STENQUIST ASSEMBLY TEST oF 
GENERAL MECHANICAL ABILITY, SERIES I, AND 
RATINGS BY TEACHERS 


(Stenquist) a 


7th and 8th grade boys, Lincoln School. a E 83 
8th grade boys, New York public schools. eee -80 


8th grade boys, New York public’ schools............... 42 
6th and 7th grade boys, Horace Mann School. ox GBE 
6th grade boys, Horace Mann School....... Aaaraum + 90 
6th grade boys, Horace Mann School.........0..0.00044 88 


quist test and the intelligence score on Army Alpha of a group of 
9°9 members of the Army Engineer Corps. 

Reliabilities as reported by Stenquist (odd-even) range from 
80 to .06, Clearly there is considerable likelihood of the incidence 
Of variable error. Paterson and his co-workers extended the test 

Y adding six more items, and obtained an odd-even reliability 
of .94 (Paterson and Others, 1930). 

ince the test shows low correlations with verbal intelligence, 
and igh correlation with such criteria as teacher ratings, the 
Conclusion is that it is a true aptitude test, and not a performance 
test of intelligence, 


TABLE 24 


Corretarions or Scores on Srenquist Test oF GENERAL MECHANICAL 


ILITY WITH INTELLIGENCE AS REVEALED BY SCORES ON Army ALPHA 
2N 


76 unselected adults ..........0cececeseeeas .300 
30 adults below score so on Alpha 0 
216 adults, low-grade intelligence ... 0 
909 adults, Engineer Corps ... «510 
30 feeble-minded .......... -32 


1007 children, 7th and Sth grade.......+... teeeeeeee 2307 
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2. Minnesota Mechanical Assembly Test * 


This is one of a battery of four tests developed under the 
direction of Paterson. The others are the Minnesota Spatial Rela- 
tions Test, the Minnesota Paper Form Board, and the Minnesota 
Interest Analysis Test. Of these three, the first and second are 
discussed below. 

The Mechanical Assembly Test here under consideration is a” 
extension and elaboration of the Stenquist Assembly Test of 
General Mechanical Ability. It consists of 33 disassembled objects 
similar to those used by Stenquist, which come in 3 boxes, A, B, 
and C, the parts of the objects being in different compartments. 
The complete assembling of each object involves a stated number 
of connections,’ i.e., the bringing together of any two parts © 
the object. A fixed time is allotted to each object, and scoring 
depends on how much is done within the time limit. If the object 
is completely assembled, a score of 10 is earned. If it is partially 
assembled, the score is computed on the number of connections 
actually made as related to the total number required for the job. 
Thus the bottle stopper in box A requires 3 connections, and the 
scores may be o, 3, 6, 10. The spark plug in box B requires 5 con- 
nections, and the scores may be o, 2, 4, 6, 8, 10 

This test, like the other three in the battery, was developed a5 
a result of long and elaborate research, the reports of which cor- 
stitute one of the fullest analyses of mechanical aptitude available- 
A large number of tests and test items were tried out, and a three- 
fold validation criterion was developed. This consists of (a) 2 
measure of the quality of mechanical work done, i.e., the quali 
criterion ; (b) a measure of the quantity of medhaneal work done: 
in relation to quality, i.e., the quantity-quality criterion ; c 
measure of information about tools and materials and their uses 
i.e., the information criterion. The Assembly test was found t 
= retest reliability of .94, and a validity against the criterio” 

Persons in mechanical occupations tend to show a superiority 
on this test. It has very little relationship to general verbal it” 
telligence or to motor agility. Thus it was found that 70% © 
mechanics did better than the average clerk, but only 11% of the 
— equalled the verbal intelligence score of the average 
clerk. 


* : P i ; 
Pe aterson, Elliott, Anderson, Toops, and Heidbreder; Hunti 


O 


Vi 
o, 


Z 


APTITUDE TESTING 233 


3. Minnesota Spatial Relations Test * 


The test consists of 4 form boards, A, B, C, D, each with 58 
pieces to be placed. One set of blocks is used for boards A and B, 
and a second set for boards C and D. The scoring is on both time 
and errors, an error meaning an attempt to place a block in a 
wrong hole. A reliability of .84 and a validity against the three- 
fold criterion mentioned above of .53 is reported. The test has a 
low relationship to general verbal intelligence and to agility. Its 
validity for specific vocational forecasts is unknown. It evidently 
measures a component of mechanical ability, for 102 automobile 
mechanics were found to make a better mean score than 82% of 
an unselected population. 


4. Minnesota Paper Form Board + 


A Sample item from this test appears in Figure 12. The ma- 
terial consists of sets of geometrical figures similar to the set there 
Shown. On the left side of the sheet is a large figure, and on the 
right are smaller figures. The task is to draw lines in the larger 

Sure to show how the smaller ones can be fitted into it, There 
are 2 Series, A and B. Timing is 15 minutes for each series. The 
Score for each is the number of right solutions. A reliability of 
#90 and a validity of -52 against the criterion used in all these 
tests and described above has been reported. 


5. LER, Assembly Test for Girls 
This is an assembly test similar to the Stenquist and the Min- 
Sota, but using material more suitable for girls. There are rr 

Subtests as follows. (1) Stringing beads. (2) Inserting tape. (3) 

aking rosette. (4) Cross-stitching. (5) Assembling key ring, 

(6) Assembling clips. (7) Tape sewing. (8) Attaching trunk tag. 

(9) Wrapping string around cards. (ro) Assembling booklet. 

(rr) Cutting and trimming paper. In the short form Worked out 

by Metcalfe and Burr, subtests 1, 5, 8, and ro have been elimi- 

rated because of various inconveniences and inadequacies, leaving 

7 subtests, The time limits are—for the complete form 45 min- 

Utes, for the short form 25 minutes. Each form is scored on 

adequacy of response to the subtests. The short form correlates 

with the long form .93 for 1,300 cases. 
* References as above. 


eferences as above. 
References: Burr and Metcalfe, 1936, 1937; E. B. Greene; Toops, 1923. 
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The test has proved useful with factory jobs requiring routine 
piecework of the general type included. Some vocational norms 
have been worked out. A score of 50% on subtests 2 and 3 is said 
to indicate fitness for simple assembly jobs; 50% on subtests 6 
and ọ is said to indicate fitness for harder assembly jobs; 5070 
on subtests 4 and 7 is said to indicate fitness for sewing (Metcalfe 
and Burr). 


6. Hand Tool Dexterity Test (Bennett) 


The test consists of a wooden frame with two uprights, in one 
of which are 12 bolts in 3 rows of 4 each, and in the other 12 
corresponding holes. Three wrenches and a screwdriver are pro: 
vided. The task is to disassemble the bolts from one upright an 
reassemble them in the other. The score is the time required. The 


test measures proficiency with ordinary tools, which the author 


regards as a combination of aptitude and achievement base 
experience. Brief practice does not seriously affect the scores: 
retest reliability of .ọ1 is reported. The test has been foune 
correlate .46 with foremen’s ratings. Percentile norms are give? 
for factory workers and for adults in a vocational guidance center 


7. Stenquist Mechanical Aptitude Test * 


Unlike those so far discussed, this is a paper-and-pencil tests 
It is intended for use for grades 9 to 12. It consists of two parts: 
The first requires the matching of mechanical objects, eB 
wrench to go with a spark plug as the proper tool to use. Th 
second calls for knowledge of the parts of machines av! 
mechanical objects. 


g. O’Rourke Mechanical Aptitude Test t 


This is another test of the same general type, i.e, calling for 
paper-and-pencil responses rather than actual manipulation 
assembling. It consists of two parts. Part I is pictorial. It is made 
up of a number of items, the task being to indicate which pictur? 
tool (screwdriver, wrench, brace, etc.) should be used on the pic” 
tured object (screw, nut, bit, etc.). Also, there are items in which 
the subject is to indicate which tool to use for a specified Jo i 
such as the cutting of a thread. Part II is verbal. It consists ° 
6o questions calling for mechanical information, in multiple choi 


* Reference: Stenquist. 
7 References: Bingham; Fryer (1931). 


f 
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form. The basic assumption here is that continuing mechanical 
Interest will result in gaining and retaining mechanical infor- 
mation. 
The test was standardized on 9,000 men of ages from r5 to 24. 
he percentile and standard score norms which are shown in 
able 25 were worked out on about 70,000 cases. It is said to 
Correlate from .64 to -84 with ratings by teachers of shopwork, 
but the figures seem extremely high, and the validation process 
1S not wel] explained. The test was widely used by the Tennessee 
Valley Authority. 
is shown in the data here cited, and in the large amount of 
additional] material to be found in the reference sources, tests 
Such as these, of which many others exist, can be adequately 
reliable, They are unrelated to general intelligence and to motor 
agility. So far, then, mechanical aptitude seems to stand up as a 


TABLE 25 


Raw Scores on O’Rourke MECHANICAL ÅPTITUDE TEST WITH 
CORRESPONDING STANDARD SCORES AND CENTILE SCORES 


(From Bingham, Table 34, p. 319) 


Raw Scores Standard Scores Centile Scores 
317 7-0 97-7 
295 6.5 93-3 
265 6.0 84.1 
233 5.5 69.1 
198 5.0 50.0 
I72 4.5 30.9 


115 3.5 . 
Doe ee S ee 


fairly well delimited concept. But factor analysis reveals that it 
is Certainly not a unitary compotent of human mentality and 

ehavior (v, Paterson, Elliott, Anderson, Toops, and Heidbreder ; 

arrell). We have, therefore, another practical working concept, 
which seryes fairly well. The tests built around it, and particu- 
larly the assembly tests, can serve useful purposes for vocational 
and educational] guidance where immediate choices and place- 
ments are the issue. For long-time predictions, however, they are 
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open to the gravest doubt. Thus Thorndike (v. Thorndike, Breg- 
man, Metcalfe) made a follow-up study of 2,225 children who were 
tested in the 8th grade for intelligence, clerical aptitudes, and 
mechanical adroitness on a battery of 14 tests. They were followed 
up for about 8 years, and the vocational success of those leaving 
school was studied. No success at all in predicting success in 
mechanical occupations was reported, although a large range of 
such occupations was included. The value of paper-and-pencil tests 
of mechanical aptitude is decidedly more doubtful than that of the 
assembly tests, in spite of the impressive validation coefficients 
reported by O’Rourke, which are open to considerable question. 


9, Test of Mechanical Comprehension * 


The test consists of 60 pictures of mechanical situations, in each 
of which a problem exists, e.g., which of 2 pairs of shears wou 
cut metal better; which of 2 cords is an ordinary electric lig 
cord; which of 2 rooms would have more echo; in which direction 
is the last of a set of gears turning. There is no time limit, bu 
the test usually takes 20 to 25 minutes. There are two forms. Form 
BB is suitable for male candidates for engineering schools, engl 
neering students, and comparable adult men. Form AA is for males 
in high school or trade school. Form BB is about 12 points more 
difficult. The test is not well suited for women. The mean score 
of women in educationally comparable groups is about 14 points 
lower than that of men. Answer sheets are provided for hand oF 
machine scoring. Split-half reliabilities of .80 and .84 are given) 
with corresponding standard errors of 4.3 and 4.5. Test-retest 
reliabilities given are .go to .93, with standard error of 3.0. 
test correlates from .30 to .60 with success in engineering-tyP° 
occupations. Percentile norms are provided. 


TESTS FOR VOCATIONAL APTITUDES 


Very large numbers of tests of various types exist for the 
measurement of aptitude in numerous vocations. These tests are 
of varied type. Some envisage classes or groups of occupations 
such as clerical, mechanical, and the like. Others are intende 
uncover fitness for some specific vocation or job. Since these are 
of minor psychometric and psychological interest, none of 


is treated here. The boundary lines, however, are indistinct, as 


+ References: Bennett and Cruickshank, 1942, a and b; Bennett and Geat- 
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they always are in the whole field of testing. The two tests for 
clerical workers here discussed might be considered as belonging 
to either of the two categories above. Both of them, and more 
Particularly the Minnesota test, are included to give the reader a 
concrete example of a measuring instrument based primarily upon 
job analysis in contrast to broader psychological considera- 
tions. 


l. Detroit General Aptitudes Examinations * 


The instrument consists of 16 subtests, each with a timing of 
3 to 5 minutes, (1) Rate and quality of handwriting. (2) General 
information, (3) Arithmetic. (4) Motor speed, shown by tracing 
circles, (5) Knowledge of the names of tools shown in pictures. 
(6) Disarranged pictures. (7) Verbal opposites. (8) Spelling 
errors, ie., misspellings to be indicated. (9) Size discrimination. 
(10) Verbal analogies. (r1) Checking test, for speed and accuracy, 
Pairs of names and numbers to be indicated same—different. 
(12) Tool information. (13) Classification. (14) Tracing mechan- 
Ical relationships shown in belt-and-pulley drawings. (15) Dis- 
arranged sentences, (16) Alphabetization. f 
In the Scoring, each subtest is given 30 to 53 points. Scores may 
© on three bases: (a) intelligence, i.e., subtests 2, 3, 4, 6, 7, 8, 10, 
13, 14, 15; (b) clerical aptitude, i.e., subtests 1, 3, 4, 6, 8, I1, 13, 
15> 16; (c) mechanical aptitude, ie, subtests 1, 3, 4, 6, 9, 12, 13, 
14. Thus there are 5 subtests in common as between the intelli- 
gence and mechanical aptitude scores, 6 in common as between 
the intelligence and clerical aptitude scores, and 5 in common as 
etween the clerical aptitude and mechanical aptitude scores, 
Correlations between sets of scores are reported as being 80, .70, 
and .73. Correlations with independent measures of intelligence 
ș e as follows: With the Detroit Advanced Intelligence Test, .90 
for 188 t2th-grade children; with Stanford-Binet (1917) 1.Q.’s, 
-652 for rgg t2th-grade children; clerical score with Detroit Ad- 
vanced Intelligence Test, .739 for 188 rath-grade children. There 
‘Sno demonstration that this is a true aptitude test, or that it has 
Special validity for clerical or mechanical occupations, Its relia- 
ilities for the three scores are 80, .90, .88 (retest). Age norms are 
Teported on about 10,000 cases. The general conclusion seems to 
be that this test, in spite of its name, functions as a general intelli- 
gence test, 


* Reference: Lorge, 1941 b. 
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2. Detroit Clerical Aptitudes Examination * 


‘This test, which comes in a separate form, consists of subtests 
1, 3,4) 6, 11, 13, and 16 of the above, and in addition a subtest on 
commercial vocabulary. 


3, Detroit Mechanical Aptitudes Examination j 


The test consists of subtests 3, 4, 5) 6» 9, 12, 13, 14 from the 
General Aptitudes Examination, and no others. 

‘As to both the last-mentioned tests, the clerical score OY the 
separate clerical test may give some basis of understanding not 
given in the intelligence items, although just what this may be 15 


not clear; the mechanical score and separate test are based, how- © 


ever, on an entirely undefined concept and there is no cleat 
criterion. 


4. Minnesota Vocational Test for Clerical Workers + 


This is an excellent example of a test oriented by job analysis: _ 


and of the general work-sample type. The content consists of pai!s 
of numbers and pairs of names, which are to be checked as samé 
or different. For example: 


5794367 5794267 
79542 79542 
John C. Linder John C. Lender 
Investors’ Syndicate Investors’ Subdicate 


Parts 1 and 3 consist of number items. Parts 2 and 4 consist of 
name items. fot 

Reported reliabilities are .75 for number checking and .93 h 
name checking. As to validation, male clerical workers d0 muc 
better than an unselected population, and one must exceed 9 f 
of the general population to do as well as the average clerk. cor 
relations from .6o to .70 have been reported with evaluated P 
sonal histories. The test is found to be diagnostic of filing ability” 
It is relatively independent of general intelligence, and thus 4 
aptitude test. In its construction many experimental tests W 
tried out on clerical and general workers and on employe 2, st 
unemployed clerical workers, and it was found to be o 
sensitive instrument of differentiation discovered. 


* Reference: Lorge, 1941 a. 
+ Reference: Lorge, 1941 C. 


+ References: ‘Andrew and Paterson; Andrew; Green and Berman} Bingho? 
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TEsts oF PROFESSIONAL AND ACADEMIC APTITUDE 


Here we encounter further attempts to work out and define 
effective and well-founded differentiating concepts, and to trans- 
late them into instruments of measurement. Two points should 
be emphasized in advance. (a) In all professional aptitude tests, 
the criterion on which by far most reliance is placed is not pro- 
fessional Success, but success in professional studies. (b) Item 
content is Strikingly similar to that found in general intelligence 
tests, combined with a certain degree of special reference or bias. 
It probably cannot be shown, and certainly has not been, that ob- 
tained scores are independent of or not highly correlated with 
Intelligence test scores. So the probability is that they are aptitude 
tests only in name, and in reality intelligence tests oriented 


towards some special group or interest. 


l. Medical Aptitude Test * 


This test is a good example of the point that has just been made. 
It Was administered annually from Washington under the super- 
vision of the Committee on Aptitude Tests for Medical Students 
of the American Association of Medical Colleges. Frequent revi- 
Slons were made, and each year data on the test, from some 600 


institutions using it, were collected, tabulated, and reported back 
With interpretive comments. 

Form 16 of the test (Moss, 1942) consists of 7 subtests as fol- 
lows, (1) Visual memory. (2) Memory for content. (3) Memory 
for content. (4) Scientific vocabulary. (5) Understanding of 
Printed material. (6) Scientific definitions. (7) Logical reasoning, 
The revisions which have been made from time to time have been 
based largely on studies of the predictive power of the subtests. 
Typical validation material has been presented in Table 3. This 
Shows the relationship of test scores to success in medical school. 
Beyond this a less determinate relationship was established be- 
tween the scores and success during internship. Kandel (q.v.), 
Considering its item content, is no doubt quite correct in consider- 
ing it in effect a general alertness test with a strong medical slant. 
Thus it does not deal with aptitude in the strict sense, and in this 
respect it is typical of many similar instruments. It represents an 
Interesting and apparently in the main successful undertaking, and 


* References: Kandel; Moss, all entries. 
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the fact that for various reasons it has recently been discontinued 
does not make it less instructive. 


2. Law Aptitude Examination * 


This is another special purpose test similar in general to the 
foregoing. For various reasons, most of them practical and ad- 
ministrative and having nothing to do with the intrinsic excellence 
of the instrument, it has not been as widely used as the Medical 
Aptitude Test. It has not gone through the’ same sequential m 
visions, nor has extensive experience with its use accumulated an 
been subjected to tabulation and analysis to the same extent. 

The examination consists of 5 subtests as follows: (1) Accuracy 
of recall. (2) Reading comprehension using legal material. (3) 


Reasoning by analogy. (4) Reasoning by analysis. (5) Skill in 
pure logic. Item content has 


more, then, it seems to figure as a sp 
test rather than what might be defined 

This, too, is the charact 
by W. M. Adams (q.v 
analogies, mixed relati 
in groups of 6), 


ecial purpose intelligence 
as an aptitude test prope": 
er of the law aptitude test constructe 
-). It consists of 8 subtests, namely difficul 
ons (giving the 2 most closely related words 
opposites (selecting from 5 choices the T 
opposite in meaning to the key word), memory (using readings 9 
judicial opinions), relevancy (using an involved legal case), read- 
ing comprehension (using judicial opinions), and legal informa- 


that a test of this type has closer relationship 
to law school achievem 


all of which he investigated. 


3. Iowa Placement Examinations + È 


These tests, which were 
the admissions and freshm 


I an guidance programs at the University 
of Iowa, deal with specific subject matter areas, including Eng 
lish, mathematics, chemi 


intelligence test, but contains acquired material from the subject 
in question. The battery yields a profile rather than a global score- 


* References: Ferson and Stoddard; Kandel, 
+ References: Stoddard, 1926, 1943. 


a bias towards legal material. Once ~ 


t 


a 


to use symbolic logic, and to interpret difficult mathematical 
reading. The Foreign Language Aptitude Test consists of subtests 
to reveal knowledge of English—parts of speech, inflexions and 


grammatical principles, reading comprehension, and translation 
from English to Esperanto. 

‘he general effectiveness and validity of the instrument may 
be Judged from the data presented in Tables 26 and 27. With 
Tegard to the correlations in Table 26, it should be noted that they 
are between test performance and marks in the subject indicated, 


TABLE 26 


Corretatrong OF SUBJECT-MATTER Marks, FIRST SEMESTER FRESHMAN 
Year, wit Iowa PLACEMENT EXAMINATION RATINGS 


(Stoddard, 1928, Table I, p. 96) 


a 

Series I (Aptitude) Series II (Training) 

Subject (Testing time go min.) | (Testing time 80 min.) 
Chemistry .50 60 
Dglish . , 55 60 
FENCD a. cssecne P 60 65 
Mathematics 55 60 
SIGS) E icent cua -50 55 


not average marks. Thus they are approximately of the same 
order, or within the range of many correlations that haye been 
reported between intelligence test scores and averages of measures 
of college achievement, and better than most correlations between 
Intelligence scores and special subject achievement, 

Stoddard very truly points out that such Coefficients are difficult 
for educational authorities to interpret, and Considers that they 
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fail to tell the story of the relationship as clearly as might be 
wished. He considers that the real situation is better and more 
clearly revealed in the data presented in Table 27. It is quite 
evident that decile ratings on the Placement Examination are 
indicative and useful, and that the quartiles are very sharply dis- 
tinct. The indication is that chances for success at the university 
are forty times as great for some candidates as for others. 

Since some rather forceful claims have been made on behalf of 
this instrument as a noteworthy psychometric advance, it is of 
some interest to compare its predictive efficiency with that of more 
frequently used measures. At the University of Minnesota it was 
found that the best predictive index for freshman achievement 


TABLE 27 


PERCENTAGE OF STUDENTS IN DECILES AND QUARTILES oF IOWA 
PLACEMENT EXAMINATIONS MAKING Grapes or A or B, C or D, 
AND F IN CHEMISTRY IN FIRST SEMESTER FRESHMAN YEAR 
(Stoddard, 1928, Table 2, p. 97) 


CHEMISTRY APTITUDE | CHEMISTRY TRAINING 
Test SCORES SERIES SERIES n 
Deciles Grades Grades 

A,B CD F | A,B C, D F 

Jo. 68 + 31 7 - 30 o 
9 44 sI 5| sx “7 2 
5 4o ao olg s at 
a a 64 9| 30 63 7 
a a6 60 14 | 30 60 10 
5: x 18 67 15 7 52 19 
=e x8 63 19 | 31 58 1 
: =e 63 25| 26 66 8 
2. 5 65 30} 19 6r 20 
I . 3 51 46 8 62 30 
upper quartile ...... sen} BE 45 4l 58 4f 4 
upper middle quartile....| 28 6r ul 32 Bo 8 
lower middle quartile....| 18 64 18| 27 Bo 13 
lower quartile sessssses. 6 60 34| x6 či 23 
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Consisted of a weighted score combining standing on the American 
Council on Education Psychological Examination and percentile 
rank in high school. The relationship of this weighted score to 
College achievement is shown in Table 28. It is not Possible to 
make a close comparison, because the data are reported in a dif- 
ferent form, but at any rate the decile predictive values here and 
at Iowa do not seem of an altogether different order. However, in 
the Iowa report the relationship is with special subject marks, 
Whereas in the Minnesota report it is with average marks, which 
Would be expected to raise it substantially. 


TABLE 28 


WEIGHTED PERCENTILE RANKS IN ADMISSIONS CRITERION AND 
- Propasiuity or Marks oF C or BETTER IN ARTS COLLEGE, 
i UNIVERSITY OF MINNESOTA 
* (v. University of Minnesota a.) 


Percent Making 


Weighted Percentile Rank N N Average Grade 

of C or Higher 
GE TO0! na N baso S4 141 95-3 
81-90 175 73-2 
71-80 170 66.9 
61-70 108 48.4 
81-60 59 32.1 
41-50 36 23.8 
31-40 24 23.0 
21-30 12 25.0 
II~20 2 25.0 
I-Io o 0.0 
Total 727 53-5 


Turning to more comprehensive studies of the problem such as 
those of MacPhail (g.v.) and Remmers (1934 a), which summarize 
data accumulated over a considerable period of time and from 
many institutions, the central tendency of correlations between 
intelligence scores and academic success is said to be between .40 
and .45 with a range from .13 to somewhat in excess of -70, and 
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about two-thirds of the coefficients lying between .30 and .60. 
However, as already pointed out, these are correlations with 


dence, Stoddard’s claim to a distinctiy, 
Procedures would seem justified, 


level of the Pintner General Ability Tests. 


E as - It is to be considered 
as a general intelligence test emphas 


1z2ing “school readiness.” 


* Reference: Grant. 
7 Reference: Symonds, 1927. 
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omission of the “times” sign between letters and numbers, etc. 
Thus the task is not merely to perform, but rather to learn, pre- 
sumably in much the same way as the subject will have to learn 
in the algebra course itself. A correlation of .71 with achievement 
in algebra after one term is reported, which of course is rather 


impressively high. 


6. Latin Prognosis Test ¢ 
This is a companion test to the former. It consists of a series 
Latin, e.g., on derivatives, singular 


of very simple short lessons in e 
and plural forms, gender, case. A test 1s placed at the end of each 
short lesson, to see how well the subject has learned it. A correla- 


tion of .8o with Latin achievement is reported, which is about as 
high as it can possibly be, considering that all the measures con- 
cerned are to some extent unreliable. 

Both these tests are interesting as embodying an important and 
very practical principle. This is the common-sense idea that a 
sample of a given function will provide excellent material for 
testing the probable efficiency of the function itself, and it is borne 
out by a number of investigations which indicate that initial per- 
formance on a learning task is closely related to final status. The 
high correlations with the criteria in each case are significant when 


this point of view is borne in mind. ’ i 
It is interesting to compare the last test discussed with a very 


different approach to the problem of Latin prognosis. Allen (q.v.), 
nd selective process, developed 


after an elaborate experimental a A = ` 
a battery of six tests intended to predict success in Latin. They 
were the Briggs Analogies Test Alpha and Beta, the Thorndike 
Test of Word Knowledge A and B, and the Rogers Interpolation 
Test r and 2. As a criterion he used. @ battery of eleven Latin 
achievement tests at the end of the first SN dun dt E 
z ; attery an DOR ee: 
multiple correlation between geen this work, used “the 


cl i ther study continuing 
em (q.v.), in ano kk a 


same criterion, and employed the 3 
number of additional factors, such as age, elementary school 
] marks in English and mathe- 


average, teacher ratings, high school ma atl 
matics, etc. It was possible by using this large array of predictive 


indices in the best obtainable combination to obtain a correlation 

of .84 with the criterion. "he 
The contrast is very striking. An elaborate array of indices, 
$ Reference: Symonds, 1927. 
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S h time and labor to amass, can yield at the best a 
ET superior to the short and simple Latin Prognosis 
Test against what is almost certainly a more reliable — 
It is a powerful argument for the special purpose test, tied closely 
to the function to be measured, not only in Latin but elsewhere. 
Of course, neither the Latin Prognosis Test nor the two extensive 
batteries used in the studies just reported can be considered instru- 
ments for the measurement of aptitude in the strict sense of the 
term. That is, they do not correlate high with the criterion and 
low with most other factors, in particular general intelligence. One 
of the difficulties encountered by Allen and Clem was that their 
batteries predicted many other school subjects practically as well 
as and in some cases better than Latin, The same is probably true, 
though to a less extent, with the Latin Prognosis Test itself, 
although so far as the present writer knows, the point has not 
been investigated. Once again we seem to have a special purpose 
intelligence test (since achievement in Latin is pretty certainly 
“saturated” quite heavily with general intelligence), which may 
be considered an aptitude test if we are willing to extend the 


ordinary strict meaning of the term, which indeed seems legitimate 
enough. 


TALENT TESTS 


: Y Or monolithic mental entity, 
and virtually as a faculty, although the word is scrupulously 


However, we know almost nothing about 
al talent, Also, our ideas in 


P 


[> 
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aptitude problem under a slightly different name. Thus to drag 
in a terminological distinction which may correspond to reality 
but which may not, and in any case makes no real difference to 
actual procedure, seems gratuitous. For here as always, the psy- 
chometric problem is to set up and delimit an effective working 
concept that will yield a good instrument of measurement related 


to the contemplated criterion. 


1. Measures of Musical Talent (Seashore) * 


This celebrated battery of music talent tests has undergone 


several revisions and improvements since its first appearance. It 
now consists of 6 subtests. (1) A test of pitch discrimination, con- 
which differs in pitch from 


sisting of pairs of tones, the second of 

the first, beginning with readily discernible differences and pass- 

ing on to very fine ones. The task is to decide whether the second 
) A test of loudness dis- 


tone is lower or higher than the first. (2 l 
f clicks, the second of which 


crimination, consisting of pairs 0 ) whic 
differs from the first in intensity, ranging aga” from readily dis- 
cernible to very fine differences. (3) A test of time discrimination, 
Consisting of pairs of time intervals marked off by three clicks, 
the second differing from the first in duration. (4) A test of timbre 
or tone quality, consisting of pairs of tones, the second of which 
differs in timbre or quality from the first. (5) A test of rhythm 
discrimination, consisting of pairs of rhythmic patterns presented 

s either the same as or 


in clicks or taps, the second of which i 1 e as 
tems increase 1n complexity, l.e., 


different from the first. The i 
length, as the test progresses. (6) A test of tonal memory, con- 
sisting of pairs of tonal patterns intended to be devoid of melodic 
significance, the second differing from the first by the alteration 
of one of the tones. The task is to indicate the altered element 
by number. 

The most important imp 
the battery are as follows. 


s which have been made in 
(a) The quality of the recordings by 
Means of which the tests are presented has been made better as 
this became possible with improved technology. Electric record- 
ings have been substituted for the pae me ae 
b ich was formerly included, but whic 

(b) The consonance test oo d, and the timbre test sub- 


proved unsatisfactory, has been droppe 
stituted. The construction of this latter subtest has been made 


* References: Seashore, 19193 Saetveit, Lewis, 
worth, r93r. 


rovement: 


and Seashore; Mursell; Farnes- 
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possible by the development of an ingenious sound source capable 
of modifying some of the partials of a compound clang without 
changing the others. (c) The battery has been reorganized. In the 
past each subtest required the playing of two record faces, which 
was time-consuming and involved the inclusion of many non- 
discriminating items. It now consists of two series, A and B, 
including all 6 subtests, each series requiring only one record face 
for each of the 6 subtests. This has been accomplished without 
loss of reliability. Series A is the easier of the two, and is intended 
for “dragnet” purposes for use with heterogeneous groups. Series 
B is recommended for use with specialists in music and for appro- 
priate laboratory purposes. 

Tables of percentile norms are presented for different ages. The 
test is not recommended for child 
shore has repeatedly insisted tha 
not represent the true meanin 
Profiles on the 6 subtests shou 
judgments are made should be 


validation, as always, is para- 
mount. One may debate the theory, but one cannot Ler dildo 
how and whether it actually work 
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and tonal memory, as embodied 
cant and revealing components, 
Various attempts have been 
nst general criteria such as 
] achievement, and the like. 


nation, rhythmic discrimination, 
in the instrument, indeed signifi 
at least in part, of musical talent? 

a to validate the measures agai 
eacher ratings on musicality, musica’ 
None of the obtained correlations are high, and most of them are 
low (Farnesworth, 1931; Mursell). Stanton has reported (q.v.) 
that the test predicts success in the Eastman School of Music 
reasonably well; but in her work she teamed it with the Towa 
Comprehension Test, and she does not present her data in such a 
Way that one can separate the predictive efficiency of the two in- 
struments. Possibly the intelligence test alone would work as well. 
_ Seashore, however, has protested against such over-all valida- 
tion against general criteria. In his view the pitch test, for instance, 


should be valid not for over-all musicality, but for certain highly 
Specialized functions, such as 4 violinist’s ability to make very 
fine shadings and gradients of tone. Indeed each one of the sub- 
tests should be validated against a different pattern of fine and 


special musical functions. This, of course, would convert validation 
to which in itself no reason- 


into an intricate laboratory problem, ; 
able exception can be taken. But it means first that the instrument 
has not in fact been validated, and second that its use for general 
classification and selection, €-8» for securing members of a high 


School orchestra, becomes very doubtful. 
In summary, it seems correct to say that the test deals funda- 


mentally with various functions of auditory acuity and discrimi- 
true components of musical 


nation. Whether such functions are 1 f 
talent, which presumably turns oP higher mental integrations, may 
well be questioned. Tf they are not, then the battery still has a 

oculist’s color chart in 


negative value, analogous to that of an 
relation to artistic talent. It would successfully reveal those who 
do not hear well enough to function musically with success. But it 


would not reveal positive components of musical talent. 


2. Interval Discrimination Test Æ 
This is another music talent test constructed in terms of an 


hypothetical premise. The presupposition is that the ability to 
make fine discriminations of intervallic quality (e.g, between 


thirds and sixths sounded simultaneously as chords), is an index of 


capacity for musical behavior. - 
The test is a series of items each consisting of a short set of 


* Reference: Madison. 
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intervals sounded as bi-chords (e.g., C-E, C-G, etc.). In each item, 
one of the intervals sounded is different from the rest included, the 
task being to identify it. The discriminations run from very easy 
to very difficult. It is too early to offer a comprehensive report, as 
the test is still in an experimental stage. However, reported re- 
liabilities are high, and some very high correlations with over-all 
criteria of musicality have been found. The essential difference 
between this test and the Seashore Measures is that it consists of 
actual musical material (i.e., sounds that actually occur in music) 
whereas Seashore explicitly avoided such content, and that it re- 


quires what is probably a higher level of perceptual discrimina- 
tions. 


3. Musical Memory Test (Drake)* 


The test consists of a series of pairs of melodic items, to be 
played on the piano or other suitable instrument. The second mem- 
ber of each pair is different from the first either in key, or time, or 
notes. The task is to indicate in which of the three respects the 
difference lies. A reliability of -93 has been reported. Percentile 
norms for ages 7 to 23 are given in the instruction manual. 

Drake does not consider this to be a test of musical achievement, 
but of musical talent, of which he believes memory or melodic re- 
tentiveness to be an important indication. There are considerable 
general arguments in favor of his position. Memory items have 
been widely used in tests of intelligence. Feats of memorization 
and retention recur in the biographies of great musicians. And 
Kate Gordon (q.v.) has shown, although with too few subjects to 
make her conclusions entirely convincing, that there are enor- 
mous differences in the memory performance of those who use 
“musical” and “unmusical” methods, Drake has been able to 


report fairly high correlations between scores on his test and 
global criteria of musicality, 


4. Tests in Fundamental Abilities of Visual Arts (Lewerenz){ 


The test consists of three parts, each containing various sub- 
tests. Part I. Preferences for proportion as shown in different de- 
signs, the task being to choose between 4 variations of one theme; 
originality of line, the task being to draw lines between dots 
printed on the test blank so as to make a picture, ro items in all. 


* References: Drake, 1933 a, b. 
t References: Lewerenz; Milton H. Bird. 


N 
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Part II. Indication of omissions of shadows in 10 drawings; a 50- 
Item vocabulary test dealing with art materials and processes, 
drawing terms, and elements in pictures; an immediate memory 
test, the task being to reproduce part of a picture of a vase from 
memory. Part III. Three tests calling for an indication of errors 
in pictures showing cylindrical, parallel, and angular perspective ; 
a color matching test, with 6 key colors to be matched with 46 


variations in hue and shade. 
The test was constructed in conn 


tion in the schools of Los Angeles an C 1 
against classroom data there. It correlates .40 with marks in art 


classes. A reliability of .87 has been reported for too pupils in 
grades 3 to 9. Decidedly the most interesting and seemingly perti- 
nent of the subtests is that requiring the subject to draw lines 
making a picture between patterns of dots. This calls for imagina- 


tion and initiative and is often thought to have some projective 
is probably the first subtest, 


significance, Next to this in excellence 
calling for judgments of proportion. The others are of doubtful 


relevance to art talent. 


ection with work in art educa- 
d has been checked for validity 


5. McAdory Art Test * 


The test consists of 72 plates, €a 
The plates are not large enough fo 
able size. The materials are drawn from curr ? ' 
azines and include objects of common use, costume items, textile 
designs, pictures, etc. In the 4 variants there are changes from the 
original in proportion, intensity, and color. The subject records his 
Preference in each of the 72 cases. In the scoring, 1 point credit is 


given for each agreement with the key, which represents, the judg- 
including artists, architects, art 


ments of a j oo experts 

teachers, eT ie Te pant included in the test were those on 
which there was agreement by at least 64% of the judges. A total 
score for the whole test can be computed. Also there are totals for 
each of its 6 divisions (furniture and utensils, texture and cloth- 
ing, architecture and related arts, § ape and line arrangements, 


Massing of light and dark, color schemes). 
The test is competently constructed and standardized. Grade 
alidity, however, is dubious. Much 


norms are made available. Its v 
of the pictorial material is now outdated and has a queer appear- 
ance, notably the costume pictures- 


* References: McAdory; Milton H. Bird. 


ch with 4 variations of x picture. 
r use with a group of consider- 
ent art and trade mag- 
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6. Meier Art Test. I. Art Judgment * 


The test consists of 100 paired comparisons of pictures, the 
pictures of each pair differing in one respect, e.g., position of the 
moon, size of topsail, etc. The difference is specified, and the sub- 
ject is required to decide which member of the pair is better, i.e., 
more pleasing, more artistic, more satisfying. 

It is a revision of the Meier-Seashore Art Judgment Test, which 
had 125 items of similar kind, reduced from an original 300. 
Criteria for item selection in the Meier-Seashore Test were: (a) 
reputability of the art work, (b) exemplification of some aesthetic 
principle, (c) suitability for testing. Each item was submitted to 
25 art experts, and the resulting experimental form of the test 
was given to 1081 individuals. Final selection of items was made 
on (a) favorable reaction of the experts, (b) 60 to 90% preference 
by the subjects. The present test was derived from the previous 
one by an analysis of the prognostic value and relative consistency 
of the items, using biserial correlation. Of the previous 125 items, 
the 25 worst were eliminated, and the 25 best given an additional 
point in a new weighted scoring system. In comparison with the 
previous test, a wider distribution of scores was obtained, which 
“suggested an enhanced validity for the new form of the test” 
(Manual, p. 14). There is also cited “additional evidence of 
validity,” among which the following points are important. Corre- 
lations with intelligence are negligible, running from —.14 to .28- 
Adults of superior intelligence do not rate as high as members of 
art faculties. Art experts obtain high scores. The fact that a junior 
high school pupil (age 12) without art training may score as wel 
as a trained adult Js thought to suggest that the test measures 
innate ability. Reliabilities of .70 to 84 are reported, without 
specifying the type of coefficient. The test is designed to indicate 
probable art talent in the general Population. Percentile norms are 
presented, based on more than 3000 junior and senior high school 
pupils studying art. The following interpretations are suggested. 
First quartile (100-76) : other things being equal, almost certain 


to succeed in an art career, especially if the indivi craft 
skill in his ancestry, good intelligence, A eo eet ato 
Second quartile (75-51) : high average art judgment, which, if 
it is associated with other favorable traits, makes it possible to 
expect much. Third quartile ( 50-26): still more compensating 


* References: Meier, 1926, 1942. 
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factors would be needed if success were to be anticipated. Fourth 
Quartile (25-1): should take other tests, and inquire further. If 
the rating were corroborated, advice would be against attempting 


an art career. 

As a footnote to the six tests so 
marked that the word talent tends qu 
artistic, musical, and perhaps literary pursuits, although the limi- 
tation is not absolute. Is there any reason for speaking of mechani- 
cal “aptitude” and of artistic “talent,” and for confining the term 
to these connections? Obviously not. Clearly we are dealing with 
nothing deeper than common usage, for there is no evidence for 
supposing or reason to believe that “aptitudes” and “talents” are 
in any way really different psychological functions. 


7. Stanford Scientific Aptitude Test (Zyve)* 
yze “scientific aptitude” into the 


following components. (1) The ability to make and recognize clear 
definitions. (2) The tendency to suspend judgment when evidence 
1S Insufficient, as contrasted to making snap judgments. (3) An 


experimental bent. (4) Power of discriminating values in the selec- 
tion and arrangement of data. (5) Power to detect fallacies and 
contradictions. (6) Reasoning. (7) The accumulation of systematic 


Observations. (8) Induction, deduction, generalization. (9) Accu- 
racy of understanding and interpretation. (10) Caution. These 
functions constitute the ro subtests: Correlations of .95, :77, and 
‘89 are reported with “competent judgments on the abilities of 
50 research students in science.” No other validation is offered, 
and this material is not adequately analyzed. Crawford (q.v.) has 
reported correlations of only 30 betwee” the test scores and sub- 
sequent marks in science for 143 Yale freshmen, and a reliability 
of only .60 for the whole test. Benton and Perry (q.v.) report 
Correlations of .30 to .37 between scores on the Zyve test, and 
grades in science over four college years for 43 students. Marshall 

des in science courses in 


(g.v.) reports correlations with gra Á 
-sophomore science grades, 


college as follows: with freshman: } i 
-404 = o9 (N =47); with junior and senior science grades, 
"345 = .09 (N = 43) ; mistry grades, .369 = .085 


with average che 

(N = 47); with average physics grades, -423 + .08 | (N = 46): 

with average biology grades, 523 + o7 (N = 46). It is extremely 

Probable that the data reported in the last three studies represent 
* References: Zyve; Bingham; Cr 1941 b. Benton and Perry. 


far discussed, it may be re- 
ite strongly to be confined to 


Zyve has undertaken to anal 


awford, 
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the predictive value of the test more accurately than Zyve’s re- 
ported correlations with instructor ratings on scientific ability. 


CONCLUSIONS 


There are of course an enormous number of instruments under 
the general classification of aptitude tests as here understood. But 
the samples that have been discussed are representative of the 
methods used in construction, the derivation of interpretive norms, 
and validation; and it may be said with some confidence that 
there are none decisively better. Thus the present writer believes 
that a not seriously misleading picture of what has been accom- 
plished is presented in spite of the necessary narrowness of the 
selection. 

It would seem that those tests are most superior in the essential 
characteristic of validity in which the central concept is most 
directly derived from the function to be measured. Such are in 
the first instance the Latin Prognosis Test and the Algebra Prog- 
nosis Test, with the mechanical assembly tests and motility or dex- 
terity tests displaying a less certain relationship to their criteria. 
But in each of such cases, the frame of reference and the basic 
concept are rigidly limited. When it comes to instruments like the 
Medical Aptitude Test or the Iowa Placement Examination, which 
are essentially general mental tests with a special slant and pur- 
pose, there seems to be a superior validity for that particular pur- 
pose to the general intelligence test. When it comes to tests con- 
structed about some broad psychological analysis of the function, 
like the Seashore or the Zyve, they are interesting and noteworthy 
in direct proportion to the debatability and seriousness of the 
general position itself. But if they possess practicable validity, this 


has not been satisfactorily shown. It may be doubted very much ' 


_ whether effective aptitude tests can be constructed in terms of such 
Ja logic, partly because psychological science has not advanced far 
enough, and partly because there is great psychological overlapping 
in functional operations. To enjoy or produce mic to paint pic- 
tures, to learn mathematics, to handle groups of Ruman beings, 
and so forth, are activities which, in all probability, have a great 
deal in common, psychologically speaking. Of unie this com- 
monality may be suppused to be greater or less in specific cases- 
But we never know quite how great it is, and all boundary lines 
are hazy. Thus psychological theory and general analysis are 
probably incompetent to define and specify working concepts 
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t. The best instruments derive 


about which valid tests can be buil 
lly defined in terms of the 


their validity from concepts empirica 
job or function itself. 

Thus it has to be recognized that the whole immense amount of 
work that has been done in developing tests of what are variously 
called aptitudes, special abilities, or talents has contributed 
nothing of signal importance towards the establishment of authen- 
tic psychodiagnostic instruments. Profiles of one sort or another 
can certainly be secured by administering batteries of tests of 
varied kinds, or instruments with a variety of subtests such as the 
Detroit General Aptitudes Examination. But such profiles are no 


more than summations of differential scoring on the tests them- 
a valid picture of men- 


selves, and cannot be taken as representing 

tal organization itself. The reason is that the most successful of 

Such tests are built about concepts defined in terms of the func- 

tions to be measured, the nature itself of which is psychologically 

undetermined. Because of their very nature and logic, the measur- 
his level. And if we turn 


ing instruments cannot penetrate below th ; f 
to tests projected on broad psychological considerations, the 


trouble then is that their relation to any recognizable practical 
function is in doubt. 
TIONAL READINGS 


sive study of the material in 
are the tests and more par- 


SUGGESTED ADDI 


Pe: additional reading and more inten 
his chapter the most important sources t d 
ticularly the manuals of the tests discussed. Publishers will be found 
listed in the bibliography of tests at the end of the book. Also the 
Teterences mentioned in the text in connection with the various festa 
may be c čurther readings and suggestions are as 1olows: 

onsulted. F ptitude testing (New York: 


Walt i titudes and @ 
pall os aay! re ‘Chapters 2 and 3, “The theory of apti- 


tude”; Chapter 16 “Selection of tests.” A broad treatment of the 
’in the first reference, and of numerous general 


eee of aptitude i greene 
em itude testing in the . 

Ce en aphids testing (Yonkers-on-Hudson, N. Y.: World 
Book Co.. 1928), Chapter 6, “The basic theory of aptitudes and 
tests,” Discusses a number of basic theoretical issues. 

Frank N. Freeman, Mental tests: their history, principles, and 
applications (Rev. ed.; Boston: oughton Mifflin Company, 1939), 
Chapter 7, “Tests for the analysis of mental capacity. ’ Aptitude-type 
tests discussed from an interesting interpretive viewpoint. 

Donald G. Paterson, Richard M. Elliott, L. Dewey Anderson, Edna 
Heidbreder, Minnesota Mechanical Ability Tests (Minneapolis: Uni- 


< 
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versity of Minnesota Press, 1930), Chapter 16, “Summary and sig- 
nificance of the results.” A summary and interpretation of their 
elaborate studies on mechanical aptitude tests. 

Oscar Krisen Buros (editor), Tke 1938 mental measurements year- 
book (New Brunswick, N. J.: Rutgers University Press, 1938); also 
The 1940 mental measurements yearbook (Highland Park, N. J.: The 
Mental Measurements Yearbook, 1941). These valuable reference 
works should be consulted for test reviews and bibliographies. 

Gertrude H. Hildreth, A bibliography of mental tests and rating 
‘scales (2nd ed.; New York: Psychological Corporation, 1939); also 
Bibliography of mental tests and rating scales. 1945 supplement 
(New York: Psychological Corporation, 1945). These very complete 


bibliographies are most useful sources for any additional tests that 
may be desired. ` 


QUESTIONS FOR DISCUSSION 


A What would be some of the practical advantages if a measurable 
function or process of general dexterity or motility had been dis- 
covered? 

2. What subtests of the Detroit General Aptitudes Examination 


resemble subtests in intelligence tests that have been discussed? Par- 
ticularize. 


3. Of the tests discussed in this cha 
according to the definitions of a 
Give reasons. 

4. To what extent might the Latin Prognosis Test and the Algebra 
Prognosis Test measure general intelligence? Why? 

5. Can you make any suggestions for material for tests similar to 
the two just mentioned for chemistry, English composition, manual 
training, typewriting? Would such prognosis tests tend to measure 
general intelligence? 

6. Could we say that the Seashore M 
chological theory of musicality, while t 
partial ee of the musician’s job? 

7. Would the orientation of general i i 
school work make them in effect «special aE ae ip ie 
those discussed? Where would the differences, if any, lie? 5 

8. Would an individual profile based on a batter, of intelligence 
tests and 5 ge tests be anything more or out anything more 

” 
nieh 4 Pee ite global scores”? Would the criticism of “global 

g. Might there be any difference in 
tween such a profile and that yielded b 
Mental Abilities? Just what, if any, 


pter, which are aptitude tests 
ptitude which have been quoted? 


€asures are based on a psy- 
he Interval test is based on a 


Psychological significance be- 
y the Chicago Tests of Primary 
would the difference be? 


b 


pa 


PN 


CHAPTER VIII 

TESTS OF PERSONALITY, INTEREST, ATTITUDE, 
AND CHARACTER 

Tur ARFA: Irs DELIMITATION AND CHARACTERISTICS 


r measurement and appraisal be- 


A great many instruments fo 
cated by the title of this 


long in the broad area roughly indi : 
ti pter. They range all the way from techniques for the observa- 
on of subjects in a normal life setting to psychometric, tests of 
More or less conventional form and construction. Projective instru- 
ments also have been treated in connection with it, but since their 
whole theory and approach is distinctive, they are dealt with else- 
where in this book. Three comments need to be made in advance in 
regard to the characteristics of the area under consideration before 


Passing on to consider typical instruments and techniques. 
Indeed, rigorous attempts to estab- 


lish The terminology is vague: Teen ti é 
sh precise and sharply defined distinctions probably create diffi- 
culties rather than removing them. Thus the reader should be 
warned in advance that words here cannot be said to have uni- 
versally accepted meanings, and that they are often not used in 
exactly the same sense by all writers, or even by the same writer 
in different places. Thus E. B. Greene (9.2-) who is unusually 
careful to define his terms, speaks of the whole field as having to 
do with what he calls “modes of adjustment,” by which he means > 
Ways in which a person approaches 4 goal.” This certainly covers 


a sufficiently wide extent of territory! But it is not inappropriate. 

gain, the word trait is frequently employed, and it is roughly de- 
fined as a fairly consistent and specific mode of behavior, which 
Once again would make boundary lines none too easy to draw. 
Temperament, once again, is taken to mean a group of more or 
less similar traits, or a pervasive and inclusive trend of the per- 
sonality. It is possible to find at Jeast two different connotations 
of the term ażtitude. An attitude may be thought of and dealt with 


as a tendency to react and feel ina certain way about some specific 
communism, or war. Or it 


issue or problem, such as free speech, $ 
may be used to stand for a generalized tendency to approach life 
intellectually, aesthetically, or in terms of religion, in which case 
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it may be equivalent to the word value. So one might go on. The 
point is not to be confused by mere terminology, or to strive fruit- 
lessly for clean-cut, mutually exclusive classifications, or to waste 
time and energy in seeking sharp definitions and distinctions which 
the phenomena themselves render impossible, at least in our pres- 
ent stage of understanding. 

The investigations of Raymond Cattell (g.v.) constitute what 
is undoubtedly the most elaborate and far-reaching attempt now 
being made to explore and clarify the ideas and concepts of this 
field. Insisting on the importance of distinguishing between trait 
modalities (1946 a), he presents operational definitions according 
to which dynamic traits are those which respond to changes in 
incentives, abilities respond to alterations in the complexity of 
the path to a goal, and temperament is of all trait-types the least 
responsive to field changes. Cattell’s general Position is that many 
so-called mental abilities are resultants of, or at least very closely 
associated with, personality factors of an apparently different 
kind. He Suggests that verbal ability, for instance, may be associ- 
ated with lack of sociability, which results in preference for 
books over people, and that mathematical ability is associated with 
lack of dominance, By means of an elaborate factorial analysis, 
he arrives at a list of 12 basic or Primary personality character- 
istics or traits. It must be confessed that at the present time this 
work remains remote from the Practical uses of psychometrics, 
and that for the most part it is no more than a promise, or perhaps 
a hope, of clearer working concepts to come. However, Cattell 
demonstrates that a personality trait recognizable in adult life 
can he a a iE Ni laboratory situation, also in 
recognizable form, which provide i i i 
test construction (1941, ene ae ultimate oii homninnl 

2. This vagueness should not blind i e 
of the field, or to the great value of En pe oe a 
ments of measurement and evaluation for dealing with it, if such 
can be found. As a matter of fact, good instruments of this type do 


by a number of investigations. 

Thus Farmer (g.v.) studied the rea 
construction test, and was able to di 
First, there was the completely contr 


Ctions of 259 boys on a cube 
fierentiate several types. (4 
olled type, showing completé 


P 
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mastery of the situation, not thinking ahead, and not worrying. 
(b) Second, there were the “thinkers,” who tended to think ahead 
In abstract terms and symbols including language and to plan 
before they acted. (c) Third, there were the “good workers,” con- 
Scientious, plodding, going ahead even when baffled. (d) Fourth, 
there were the “fools,” inept, never knowing what to do next. (e) 
The fifth was a miscellaneous category, containing those with no 
very definite characteristics in action. After three years of train- 
Ing, the industrial proficiency of these boys was measured by a 
Practical examination, and a decided difference in favor of the 
first three types as contrasted with the last two was revealed. 
This, of course, is not a conclusive study, but it is suggestive. If 
Instruments for the appraisal of personality and temperament 
Can be shown to have prognostic efficiency after the lapse of three 
years, it is clear first that the tests themselves are useful, and 
Second that the factors revealed are of major importance. 

nce again the Pannenborgs (g.v.), in a notable study, found a 


remarkable consistency in traits and general temperament among 
and musically active. They 


Persons who are musically talented, 
made a study of 3,860 children of whom 494 were known to be 
Musical, 423 professional adult musicians, and the biographies of 


21 eminent composers, and report a striking trait similarity among 
talented and musically 


all three groups. The groups of musically 3 l 
active persons are decidedly above the average in physical activ- 
ity, not very industrious, highly emotional in their reactions to the 
circumstances of life, intellectually versatile with high literary and 
artistic interests, imaginative sometimes to a pathological degree, 
not orderly, punctual, or “scientific,” endowed with strong vital 
needs, fond of eating and sexual expression, interested in the oppo- 
site sex, physically healthy, but often nervously unstable. The 
Seneral purport of this work is that what is ordinarily called musi- 
cal talent is by no means an isolated and special quasi-faculty, but 
a general and pervasive setting of the entire personality. 

But perhaps no research investigations are really needed to 


€monstrate the paramount importance of the area here under 


Consideration. . P P 

3. As typical instances of the psychometric work in this area 
come up for discussion, it will become evident once again, and 
Perhaps with special force and clarity, that everything depends on 
the isolation of the proper working concept if the resulting instru- 
ment is to have any value. To isolate and define a concept that will 
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lead to a clearer understanding and a better control of mentality 
or behavior (whichever term one prefers), and then to translate it 
into viable items, is the logical pattern of test construction here as 
elsewhere. But in some respects the process is thrown into sharper 
relief, and its essential nature more manifestly revealed in con- 
nection with the measurement and appraisal of personality, inter- 


est, attitude, and character than in any other connection. 
3 


MEASURES OF PERSONALITY 


Here workers in psychometrics have found themselves coping 
with a most inclusive concept. The definition put forward by All- 


port (1927) is probably as good as any. Personality is said to“ 


mean “the individual’s characteristic reactions to social situations, 
and his adaptation to the social features of his environment.” All- 
port has been able to find within the meaning so indicated certain 
psychometric leads. He considers the prime factors in personality 
to be first, intelligence or general adaptability ; second, motility or 
speed of reaction; third, temperament or the individual’s prevail- 


ing emotional reactions or moods ; fourth, sociality or tendency to 
social participation ; fifth, the individual’s manner of solving social 


problems. All these obviously overlap. Measures which are con- 
sidered to have to do with personality emphasize the last three. 

` The instruments to be considered here, and which are represent- 
ative of a very large number of similar examples, may be classi- 
fied and interpreted in terms of their use of one of three prevail- 
ing modes of technical treatment and approach. (a) The first of 
these is the use of self-rating items centering about concepts em- 
pirically isolated and defined. (b) The second is the use of self- 
rating or self-revealing items centering about concepts defined by 


systematic psychiatry. (c) The third mode of technical approach, 


is by ratings made by someone other than the subject himself. of 
the tests to be discussed below, numbers 1, 2, and 3 belong to the 


first category, numbers 4, 5, and 6 to the second, numbers 7 and B 
to the third, while number ọ is of a special type. 


1. The Adjustment Inventory (Bell)* 


This is a self-questionnaire. It consists of 160 questions, such a5 
the following: Do you daydream frequently? Did you ever have 


a strong desire to run away from home? Do you take cold rather 
* Reference: Bell. 


| 


Or-no responses. The total score is 
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? Do you enjoy social gatherings? Are you 
structions, these ques- 


” 


easily from other people 
often sorry for yourself ? According to the in: 
tions are to be answered “frankly and honestly. 

The scoring of the answers is intended to reveal the subject’s 
Status with reference to 5 aspects of adjustment. (1) Home adjust- 
ment, i.e., whether he is satisfied with his home life and associa- 
tions. (2) Health adjustment, i.e., whether he has been ill much, 
has had operations, suffers from minor ailments. (3) Social adjust- 
ment, i.e., whether he is shy, retiring, submissive. (4) Emotional 
adjustment, i.e., whether he is easily disturbed, nervous, depressed. 
(5) Occupational adjustment, i.e., whether he is satisfied with his 
Job, its associations, conditions, etc. The endeavor to insure that 
this would be a valid instrument turned on item selection. Items 
Were chosen on clinical experience, On their correlation with similar 
items in other such inventories, and on their power to discriminate 
between well-adjusted and jll-adjusted persons. A reliability of .94 


IS reported for the whole instrument. A 

he inventory is internally consistent, but it is clear that the 
categories of adjustment, so called, are entirely empirical. Also it 
18 doubtful whether self-questioning of this direct and obvious 
kind can yield authentic insights, and of course there is the ques- 
tion whether the five types of adjustment are really distinct, fun- 


damental, and meaningful. Moreover it is also open to much doubt 
7 swered yes or no, as the inventory 


whether the questions can be an the 
requires. It is, however, competently set up, and within the very 
gtave limitations indicated, probably about as good as such instru- 
ments, of which there are many, can very well be. 


2. California Test of Personality * 
questionnaire. Tt calls for yes- 


This i i ical self- |: 
sA anotherdan y GP ivided into two major parts: 


first; self-adjustment, which includes self-reliance, sense of per- 


Sonal wor of personal freedom, feeling of belonging, 
th, sense pi and freedom from nervous 


freedo i wing tendencies, 1 1 
Sy Soe mami adjustment, which E social stand- 
ards, social skills, freedom from antisocial tendencies, family rela- 
tions, school relations, and community relations. The instrument 
yields a total score, & self-adjustment score, and a social adjust- 
ment score, and also a profile indicating status on the various ad- 

* References: Tiegs, Clark, and Thorpe; R. Cattell, 1941 a; Symonds, 1941; 
ernon, 1941. 
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justment factors. Although factorial studies were es m 
the construction and revision of the test, they failed to y mo 
usable and fruitful outcomes, and the indicated components = 
based on what the authors speak of as “logical analysis, as Re 
as experience, the judgment of field workers, and considerable 
statistical research. In framing the items, the attempt was made 
to encourage frank replies by rationalizing for the pupil. For 
instance, a question would not be phrased : “Do you play truant ? 
but rather, “Are things frequently so bad at school that you just 
naturally stay away?” Split-half reliabilities are reported on 558 
cases: for total score .931 with a standard deviation of the distri- 
bution of 19.9; for self-adjustment score, .904, with an S.D. of 
11.5; for the social adjustment score, .908, with an S.D. of 10.0. 
Evidently this test, in spite of its careful construction, is open 
to the same questions which apply to the Bell Inventory. The 
direct type of self-questioning is obviously open to distortion. 
The concept of adjustment, and its particularization, upon which 
the whole instrument stands or falls, is undeniably quite vague, 
so that the basic validity of the test is open to grave doubt. More- 
over, the suggestions in the manual for diagnosis and treatment 
cannot but arouse certain misgivings. The kind of amateur minis: 
trations which seem to be suggested can easily be dangerous. 


3. Personality Quotient Test (Link)* 


“Do you have stage fright? Do yi 
Do your classmates mak 
often brood over your m 
spells of dizziness?” Also, there are lists of games, studies, hobbies, 
etc., to be checked as liked 

score. It is designed to ra 
(1) Personality, which in 
tiative, i.e., “habits” of taking initiative with other people. (3) 
Self-determination, i.e., the doin 


RA 


rds the opposite sex. 
eported, which are not sufficient 


for individual diagnosis. Validity is not specified or investigated- 


* Reference: Link. 


N 
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nstrument (Link) attacks the require- 


ment of validation, which he asserts is of secondary importance. 
The reason for this is that the instrument turns upon his concep- 
tion of personality as “the possession of habits which interest and 
serve other people.” Link feels that if the test is logically con- 
structed to embody this concept, its external validation does not 
Matter. 
b In terms of this concept he proposes to 
y converting them into “personality qu 
the 1.Q. This so-called “personality quotient” is obtained by find- 
ing the difference between the subject’s obtained score and the 
mean score for his age group, dividing the result by the standard 


deviation of the group, and multiplying the result by 17. The 
the approximate standard devia- 


Per 17 is chosen because it is \ 
ion of intelligence quotients as obtained by the Stanford-Binet 
scale. Needless to say, this is a piece of pseudo science completely 
irrelevant to the matter in hand. In fact the whole set of statistical 
manipulations, and the resultant “quotient,” are completely far- 


cical, 
This instrument has been presented to give the reader some idea 
of how bad a thoroughly bad test in this area can be. Obviously 
e interpreted in terms of 


dubious answers to impudent questions ar 


a scheme of concepts which have no assignable meaning or diag- 
ly to give them a show of 


nostic significance. And then, presumab a 
authenticity with the public, they are distorted by irrelevant 
Statistical treatment into a score which has a, sound analogous to 
the most widely known of all psychometric measures, the very 
name of which falsifies the method employed to calculate it. Tests 
of personality turning upon empirical concepts are always open 
to the suspicion of superficiality. But at least they can be con- 
structed in such a way that whatever meaning they may have is 
reported in the scores in an honest and straightforward fashion. We 
turn now to two personality tests whose basic concepts are drawn 
from psychiatry rather than from empirical analysis and conjec- 


ture. 


In fact, the author of the i 


interpret obtained scores 
otients” on the model of 


4. Personality Inventory (Bernreuter)* 

This is among the best known instruments in the field of per- 
sonality measurement. Tt consists of 125 items of the self-rating or 
self-questionnaire type. The items were chosen in part from pre- 


* References: Bernreuter; Flanagan, 1935- 
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vious personality tests, the basis of choice being their power to 
differentiate between persons rated high and low on the contem- 
plated criteria, which will be explained below. Sample items are 
the following: Does it make you uncomfortable to be unconven- 
tional or “different”? Do you often feel miserable? Do you try to 
get your way even if you have to fight for it? 

The basic technical novelty is to treat each item-response as 
indicative of several different traits. In the original inventory 4 
such traits were set up, but 2 more have been added due to the 
work of Flanagan so that there are now 6. They are as follows. 
(rt) Br-N, Neurotic Tendency, i.e., the trait of emotional instabil- 
ity. (2) B2-S, Self-Sufficiency, i.e., the trait of rarely asking for 
sympathy, ignoring advice, liking to be alone. (3) B3-I, Introver- 
sion-Extroversion, i.e., the trait of being imaginative, living in 
oneself. (4) B4-D, Dominance-Submission, i.e., the trait of domi- 
nating others in face-to-face relationships. (5) Fr-C, Self-Confi- 
dence, i.e., when high the trait of hampering self-consciousness, 
when low the trait of wholesome self-confidence. (6) F2-S, Socia- 
bility, i.e., the trait of being nonsocial, indifferent, which is indi- 
cated by a high score here. d 

‘The operation of the scoring scheme may be gathered from 
Figure 25. The scores of the three possible responses—yes, no, 


Do you like to bear responsibilities alone? 


Br-N | B2-S B3-I B4-D Fı-C F2-S 


Neurotic| Self- | Introversion- Dominance| Self- Socia 
- — 2 È 
Tend Sufi Extro- Submis- | Confi- | pility 
ency | ciency version sion dence 
Ves. seee = 4 —I Ps ae 
3 4 
INO PRE 2 =4 2 = ae i =3 
Doubtful. —2 ž I 2 | 2 —2 


Fic. 25. SCORING oy ONE ITEM ON BERNREUTER INVENTORY ON SIx TRAITS 


doubtful—to one item are presented. The score values run from 
+5 to —5. As will be seen, a positive answ 
you like to bear responsibilities alone?” 
Self-Confidence, and of —ı on N 


er to the question “Do 
carries a score of +4 on 
eurotic Tendency. A negative 
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Roemer $ carries a score of —5 On Self-Confidence, and +2 on 
i eurotic Tendency, and so on. Similar sixfold values are assigned 
o all three possible answers to the 125 questions which make up 
the inventory. 


The addition of the two “F” traits came about in the develop- 


wae of the inventory. Bernreuter found that neurotic tendency 
introversion-extroversion correlated about .95, 59, that they 
re virtually the same. Thus only 3 of the original 4 “B” ratings 
were needed. Flanagan, by means of factor analysis, drew the 
se that two traits, which he named self-confidence and 
eine and characterized as above, were the chief components 

the first 4. So only the two «p” traits are necessary, although all 


6 are retained. 

li Two simplified scoring plans for the inventory have been pub- 
ished. The first of these (Kempfer) rates all values from +3 to 
—3 as o, 4 and above becomes 1, —4 and below becomes —1. It 
is not suitable for accurate work, put useful in the rapid location 
of extreme cases. The other plan (McClelland) counts all answers 


weighted -+3 or more, and subtracts all weighted —3 or less for 
: reported as correlating 


each trait scale. The resulting scores are 

with the full scale from .95 to 84 for various traits. Both plans 

greatly reduce time and labor. a s 

2 Lorge (1935 a, b) has made a highly critical analysis of the 
hventory. (a) He finds reliabilities of -88 for scores on Neurotic 
endency, .80 for scores on Self-Sufficiency; 87 for scores on 

Introversion-Extroversion, 85 Dominance-Submis- 

sion. These are not sufficient for individual diagnosis; Further- 

more, he finds that the separate traits are not self-consistent, 1.€., 


that the i f which the rating on each trait 
it e total of whic 8 
em scores on th + one another. He also finds 


is determin i i $ 

ed are inconsistent W! l 
that the traits are not independent. b) He argues that the classi- 
fication of traits is not valid, and that they are mere fiat” traits, 


statistical arti ther than authentic factors 1n mental organ- 
artifacts, ra n which the instrument is 


ization. In other words, the concepts © h the 
built and in terms of which performance on it is interpreted are 
not authentic, according to Lorge. (C to validation, in general, 
he points out that the evidence py which this 1s established is the 
agreement of the Inventory with other personality tests. But, as he 
remarks, it was built upon them in the first place, and independent 
clinical validation is essential. A : 
This last point in particular has been taken up by other investi- 
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gators. Landis and Katz, and Landis, Zubin and Katz (q.v.) se 
that the Inventory cannot discriminate between college studen s 
and hospitalized psychotics. A high score on the various traits 
seems to indicate poor adjustment, but a low score does not prove 
the opposite, chiefly, as they believe, because of dishonest re- 


sponses. That is to say, if a subject honestly makes the — 
to the items which lead to high total scores on the traits, he wi 


D 
reveal his maladjustment. But he can conceal it readily by making 
ates as “acceptable” or “credit- 
able.” 


answers which common sense indic; 


Newcomb (q.v.) again in a very interesting field study dealt 
with the problem somewhat differently, His subjects were groups 
of normal individuals living in a camp. Daily records of their 
behavior were made by the camp counselors, Thirty items indica- 
tive of introversion were set up, such as self-confidence, responsi- 
bility, etc., and the actual i 


n these items. Behavior was found to 
have little consistence, i.e., a trait did not 


gs of these individuals, No evidence was 
discovered for an intr 


P to psychological 
- Nor can the matter be im- 


5, Humm-Wadsworth Te 


This is a profile scale, g 
of them yield scores, and x 
duced to create a “test atm 
the scored questions. The items are Scored in a manner analogous 
to the Bernreuter procedure, on 7 Personality types as follows. 
(1) N, Normal, characterized by sel “Control, self-improvement, 
and inhibition. (2) H, Hysteroid, characterized by tendencies to 
self-preservation, selfishness, and crime. (3) M, Manic Cycloid, 
characterized by elation, excitement, and Sociability. (4) D, De- 
pressive Cycloid, char. ; 


acterized by Sadness, retardation, caution, 
* References: Humm and Wadsworth; Humm; Mosier, 1938. 


mperament Scale * 
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and worry. (5) A, Autistic Cycloid, characterized by shyness and 
sensitiveness. (6) E, Epileptoid, characterized by ecstasy, meticu- 
lousness, and inspiration. (7) P, Paranoid Schizoid, characterized 
by fixed ideas, restiveness, and conceit. Each significant response 
is rated from 1 to 5 on each of these 7 types, and the ratings are 
added to make 7 scores, which are then rated from very strong to 
very weak. 

A commendable feature of this instrument is its use of what is 
probably about as sound a classification of personality types as can 
be had, derived from psychiatric theory and practice, This 
IS in contrast to the somewhat vague, and ill-defined cate- 
gories of the Bernreuter Inventory. The items have been selected 
because of their power to differentiate persons known to be high 
1n one or other of the type categories. It has been found helpful 
in personnel work, for, of the 2,000 cases chosen as being satisfac- 


torily adjusted on the basis of their test showings, very few were 
s. Poole (qg.v-), the medical 


rad discharged for personality reason e mec 
irector of the Lockheed plant, where the test has been quite widely 
used, presents a favorable account, though without going into 

etail. 
and requires skill and judg- 


The instrument is not easy to use, ano (Ta : 
ment for a proper interpretation and application of its results. 


6. Minnesota Multiphasic Personality Inventory * 

The Minnesota Multiphasic Personality Inventory consists of 
550 items, each in the form of a simple declarative statement in the 
first person singular, to which responses of true, false, and “can- 
not say” are to be made. Instances are: “I seldom worry about 
my health,” “My daily life is full of things that keep me inter- 
ested,” “T sometimes feel like swearing.” The basic assumption is 
that the item-responses, when grouped, will form numerous 
Scales, Scales have been developed for hypochondriasis, depres- 
Sion, psychopathic deviate, psychasthenia, hypomania, hysteria, 
introvert, schizophrenia, paranoiac. Also the inventory yields a 
question score, i.e., a score on the “cannot say” responses, a lie 
score, and a validity, or F score. The lie score expresses the tend- 
ency to falsify for the sake of making socially approved responses. 
This is provided by responses to 15 items which are in effect catch 
questions, such as “I get angry sometimes.” The F score is the 

* References; Hathaway and McKinley (all entries) ; McKinley and Hathaway 
(all entries) ; Manual; Supplementary Manual. 
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‘ymptomatic depression 15 

recognizable frame of mind characterized 
by a poor morale, lack o hope in the future, and dissatisfaction 
atus generally” (Hathaway and McKinley, 
1942, P. 74). The Construction key for each scale, i.e., the 


appeared, the usual pro- 

5 7 . n- 
other against clinical diagnosis. Meehl (an ieee 
the records of 147 hospitalized psychotics, 
efore testing, Of the two-thirds of the 
re identified, about two-thirds were placed 
€gory, which is much better than chance- 
(9.0.) find Considerable variation in the 
Be the various Scales on the basis of clinical 
eee on deviate, Paranoiac, and schizophrenic 
ersion £ meas For yPochondriasis, depression, 
(masculinity-feminityy, and psychasthenia 
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) makes the point that even though 
agree with the clinical im- 
directs the clinical investi- 
and makes the clinician 
e be overlooked. Morris 


there was none. Leverenz (q.v. 
the scores do not always corroborate or 
Pression, the use of the inventory often 
gation into new and fruitful channels, 
aware of problems that might otherwis 
(q.v.), on the other hand, reporting on the use of the inventory 
With 320 naval personnel, finds it unsatisfactory when clinical 
evaluations are used as a criterion. It differentiated borderline 
Normals from serious psychopaths, but did not aid in differential 

lagnosis among psychopathological groups. His conclusion is that 
the inventory “at its present stage of development .. . cannot be 
regarded as a practical clinical tool, the results of which can be 
accepted as valuable diagnostic aids to the psychiatric member of 


the clinical team” (p. 374). 


: 


7. Detroit Scale for the Diagnosis of Behavior Problems * 

_ This is a rating scale to be used by a trained examiner. It con- 

sists of 66 items under 5 headings as follows. (1) Health and 

Physical factors. (2) Personal ha and recreational factors. 
(4) Parental and physical fac- 


Angs on the 66 items under these 5 he : 
direct observation, scrutiny of medical and educational records, 
and questioning of the child and his p: 

5 Point scale (x very poor, 2 Poor, 3 fair or average, 4 good, 5 very 
good). The rating values are described in detail for each item. 
Thus item 18 is “Home duties.” The question to the child is: 
“What regular jobs do you have to do to help around home?” The 
questions to the parents are: «What regular jobs ae he have 
around the house? Does he do them without urging , a e item is 
scored as follows: score 5 for a reasonable number (o rer done 
regularly and willingly ; score 4 if he has to work most of the time, 


fairly willi 4 if he has some duties but not regular ones, 
y willingly ; score 3 ore 2 if he is forced to work 


with little i ganization ; SC 

planning or orga ; ae 3 
most of the time, with no time for recreation ; score 1 if he has no 
duties, and is encouraged to despise any work, or if he rebels and 


absolutely will not accept any duties. ‘ 
In commenting on this item, the authors (Baker and Trap- 


hagen) point out that the child ‘untrained to work in the home is 
apt to a much more immature than others, since work at home 


* Reference: Baker and Traphagen. 


- Thus the instrument implements oa te 
‘rolled personal survey of behavior Problem cases, oriente nme 

i r most characteristic causes. ad 
mary sheet shows the rating on each item, the number of i 
rated, and the total Score obt 


P i t- 
ained by adding all credits. This la 
ter can be transmuted into a letter grade, 


isa 
a Tropsists of 2 schedules. Schedule tel, 
behavior Problem record. It lists 15 Such problems, e.g., defiance 
discipline, speech difficulti 


; ah ical 
list of 35 traits divided into 4 sroups—intellectual, goss: 
social, and emotional—each of which is to be rated on a 5-po 
scale, A re-rating correlati 


der Consideration. 
9. Logical Decision Test + 


Ich one would have to make a, 
* Reference: Wickman, a 
7 Reference: Brandt, 


Pp 


ny 
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decision, e.g., finding money or property in a public place and 
eciding whether to leave it, keep it. seek to return it to the 
Owner, etc. The subject is asked to consider carefully and to give 
reasons for his decision. He is assured that no adverse judgments 
will be passed, no matter what he decides or what reasons he gives. 

nswers are classified in terms of 6 kinds of goals—self-regard, 
Parental approval, friends’ approval, general welfare, objective or 
Practical considerations, social institutions. There has been found 
to be considerable hedging and evasion connected with unethical 
choices. The test is not of major importance, but embodies a not 


unpromising diagnostic technique. . i 
y way of a brief comprehensive appraisal of personality tests, 
ay be said that in the great 


Which a very large number exist, it ma} a 
jority of cases they are at best experimental. Ellis (q.v.), who 
as made an extensive and thorough analysis of the literature, 


finds that while reliabilities as reported are “notoriously” high, 
dealing with the valida- 


tion of group personality tests, 80 reported positive results, 44 were 
i i ive. For individual per- 


n , i 
ventory, out of x5 validation stud 


aa, and 2 were negative—a str 
a a findings for five persona 
There has been some criticism of Ellis’ investigation, but it 
joes probable that its general outcomes, at any rate, are de- 
Pnsible, In any case they are confirmed by a somewhat similar 
casen less extensive investigation by Traxler (1946), who con- 
f udes that “nearly all reputable personality testing outside care- 
ully controlled clinical situations is still frankly tentative and 
Xperimental” (p. 424). And Kornhauser (1945 b), in his poll of 
iS Opinions of psychologists, reports that of 79 who replied, 1.5% 
tS Personality inventories highly satisfactory, 13.5% found 
Sinan Moderately satisfactory, and 
Ugh rather unsatisfactory, to highly , i; 
] of questionnaire responses, 
ne of the few studies 
made on this topic. They gave from the Bell Inven- 
tory and varios “Teraeston attitude scales to 132 students in social 
in all at intervals. Seventy- 


Studj ° 
aS dies courses, giving them three times i 2 i 
© per cent of the responses were consistent. Interestingly, 
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TABLE 29 


VALIDITY STUDIES OF PERSONALITY QUESTIONNAIRES 
(Ellis, p. 424) 


Number 
Number | Number of Number 
of of Ques- of 


Test Employed Times | Positive| tionably | Negative 


Em- Valida- | Positive Valida- 
ployed tions Valida- tions 


tions 
— ~ 

Bell Adjustment Inventory ...... 12 I o II 
Bernreuter Personality Inventory 2 9 6 14 
* Thurstone Personality Schedule. . 10 4 I 5 
Woodworth Personnel Data Sheet] 2 II 4 14 
Other Personality Tests ..... SEa 82 40 15 a 
ORAM: iraa ainoo atakita PEA 162 65 26 7 


responses to factual questions were less consistent than those in- 
volving attitudes and evaluations. 


Measures or INTEREST 


The measurement of interest has proved a decidedly more man- 
ageable problem than the measurement of personality. A goodly 


number of successful instruments for this purpose have been de- 


veloped. This is because the basic concept itself is much more 
clearly definable. An interest may be described as a tendency tO 
make consistent choices in a certain direction without external 
pressure and in the face of alternatives. Interests as so understo0 
are, within limits, observable. Moreover, an individual can make 
reasonably accurate and dependable verbal reports about his in- 
terests, which to him mean his preferences, 

Furthermore, a good deal of well-oriented and sequential inves- 
tigation has been devoted to the topic of interest through the 
years. This has made possible the construction of interest tests 
and scales with a real psychological content and a constructive 


TESTS s 


Sa intelligible psychological meaning. The main points which 
ave come to light may be summarized as follows: 


1. Interest and success 


ae relationship between interest in 
is re either in words or actions, an 
Fa o viously of great importance. In 0 
at relationship is, a twofold distinction must be made. 
F A. The relationship between interest in any activity and objec- 
Ive success in it as compared to other people is doubtful. This is 
Particularly true when expressions of differential interest in rather 
Similar lines of activity are elicited; as, for instance, the degree 
of interest a person feels in school courses of academic type (v. 
Bridges and Dollinger). Such expressions of differential interest 


a to have little relationship to relative success. The relation- 
Ship becomes closer if a wider range of school courses is con- 
usic, and so forth, as well as 


Fi ali manual training, art, m : 
cademic studies (v. Thorndike, 1921). If the range of differences 
etween preferred activities is still further extended, and there is 
Choice, let us say, between intellectual and social doings, then 


Preference begins to be of some significance in predicting success 
d in a number of places. 


Vyman), This finding has been confirme € 
d interests is compared 


B. If th e 
th e pattern of a person s expressed inte 
With the pattern of his Eea abilities within himself rather than 


With his achievement with reference to other persons, then the 


Telationshi i Thorndike, 1917; King and 
t be high (v. Thorn ike, 1917; g 
D np onk O Eni tandable. One may be greatly 


Adelstein). This is perfectly unders 
interested in some Tine of 7 tivity and yet be excelled by many 
Other and still more capable persons: But the existence of the 
Interest may still indicate one’s own best capabilities. , 
hese early results seem in general consistent with the extensive 
extremely important work of Strons- He regards interest as 
What he calls an “indeterminate” indicator of success. That is to 
Say, interest tends to be associate with success, but not directly, 
Since both are affected by many other factors. However, in the 
Case where interest continues over 4 lapse of time, the relationship 
1S closer (Strong, 1943). However, as Carter (1944) points out in 
is Summary orien years of work in this field, the criteria of 
Success in any vocation are not adequate, and this makes the 
Validation of scales pee tests for the measurement of interest and 
€ detection of interest patterns difficult in terms of success. The 


any line of activity, as 
d success in that activity 
rder to understand what 
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relationship of interest to satisfaction seems closer and more 
determinate than its relationship to success. 


2. The permanence of interests 


This latter relationship between interest and ability or success 
is greatly strengthened if the interest in question is of long stand- 
ing, as, for instance, if it can be shown to exist through the 
elementary school, the junior high school and the senior high 
school periods, and into adult life (Thorndike, 1917)- Moreover, 
it has been shown that interest patterns tend to grow more stable 
as life advances. Certain youthful interests, to be sure, tend to 
fade out, and others take their place. Such doings as active out- 
side amusement or the reading of fiction are apt to lose their 
appeal, and quieter occupations to be substituted (Thorndike, 
1935). Yet Strong (1931) has shown that in general the things 
most liked at the age of twenty-five are liked more some decades 
later on, and the things least liked at twenty-five are less liked 
later on. So again in 1943 he reports that the interest patterns re- 
vealed by interest scales are highly permanent, and little influenced 
by training and experience in the occupations concerned. Carter, 
too, finds that the vocational interests of high school pupils are 
incompletely developed, but highly individual, definitely pat- 
terned, and “much more reliable and permanent than earlier 
studies would indicate” (1944, p. 68). The inferences for the practi- 
cal problem of measuring interest are obvious. With very youns 
subjects the significance of any such ratings is dubious. But with 
older persons they may very well indicate a permanent life trend 
significantly related to the individual ability pattern. 


3. Interest groups 


The findings reported so far establish nothing more than that 
a reliable interest rating, if it can be obtained, is likely to have 
considerable significance, and that if feasible psychometric instru- 
ments can be devised in this area, they will be well worth while. 
But the decisive point for the measurement and evaluation 9 
interests has been the establishment of the fact that there arè 
fairly well-defined interest groups. These are groups of persons 
fairly homogeneous in interest, and differing significantly in this 
respect from other persons. 

Lewis and McGehee (q.v.), for example, have shown that there 
are significant differences in interest as between bright and dull 


Lh 
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children. The comparative interest patterns of these two groups 
are shown in Table 30. The data repay careful scrutiny. It will be 


seen that the high differentiations occur with reading, sports and 
dramatics, and collecting, 


th wie playing musical instruments, 
nd that others are definitely significant, and also that the superior 


Subjects had many more hobbies than the retarded. 
of differential interest groups 


The establishment and analysis 
has been carried to a considerable length, most notably by Strong 
(q.v). Thus, persons in each of a large number of occupations and 
families of occupations are known to exhibit characteristic inter- 
est patterns. A young man is likely to enjoy an occupation when 

is interests harmonize with those of adult workers in it. More- 
5 ver, the successful persons in occupations exhibit the character- 
Stic interest patterns with peculiar definiteness. Again, it 1s known 


at persons in different educational curricula tend to exhibit 
e are not so clear-cut as 


differential interest patterns, though thes s i 
hose in different occupations. It may be that different social 

8toups also exhibit differential interest patterns, but so far this has 

Not been determined (v. Fryer, 1932; Strong, 193° 1943) +, 

f trong (1943, v. ch. 8) has brought to bear the techniques of 
actor analysis upon the study of interest. He has isolated four or 
ve factors, and on this basis has established eleven groups of 


Men’ i 5 : 
n’S occupations, and ten groups of women’s occupations. 


INSTRUMENTS OF MEASUREMENT 


i All this work clearly opens the way for the development of 
"struments of measurement. A scale can be devised and stand- 
atdized in such a way that it will not merely elicit whatever 
Preferential interests a person may happen to have, but will show 

rest pattern of this or 


er relati i istic inte’ 
onship to the characteris ern 
occupational or educational grouP- When this is properly 


ne, the result is an instrument of very considerable value for 
Suidance and appraisal. Once we recognize that an individual’s 
Stablished interest pattern is related to the pattern of his own 


toilities, and furthermore that successful persons in various func- 
stic interest patterns, it is manifest 


lona] E A 
that e St ee ey significant interpretations and 
rognostications As Bingham (q: remarks, interests properly 

nd frankly formulated ~a a situation where the subject is not 

ected by considerations of what he ought to say to make a 
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TABLE 30 


PERCENTAGES OF SUPERIOR AND RETARDED Boys AND GIRLS DESIGNATED 
as INTERESTED IN VARIOUS HOBBIES 


(Quoted from Lewis and McGehee, Table 2, p. 598) 


Boys GIRLS 
Haps Superior | Retarded | Superior Retardec 
——— 
Reading novels .---+++++- 50 23 60 3I 
Reading history and science 31 9 22 9 
Reading funny papers..... 49 39 50 4i 
Active games and sports... 67 54 2 38 
Quiet games .---.-- eee ee 26 15 29 24 
Playing musical instruments 22 10 28 Ir 
Listening to the radio..... 39 30 37 29 
Sewing, knitting 3 4 36 34 
Housework ...... P 7 5 32 40 
Going to shows 33 30 34 29 
Dramatics, participation .. 7 4 16 6 
Make-believe games ...... 9 6 24 16 
Religious activities ....... 17 II 21 15 
Building things, shopwork. . 34 27 4 3 
Traveling 13 8 1 8 
Driving car .. 7 9 3 3 
Studying .....s.... 9 4 Ir 6 
Working, farm, store. 10 17 3 5 
Clubs—social, dance ...... 4 2 9 6 
Scouting or other serious 
club activity .......... 13 6 10 

Collecting .....sseeeeeee 30 9 22 I 
NOS wacsicthinahiatinincieiita ¢ I 14 I 12 
Nuiibers: aaen 1700 5009 2505 1618 


good impression or by ideas about prestige, are a highly mem, 


ingful sign. Such, then, is the basis on which the measurement o 
interest proceeds. We now turn to a discuss 
sentative instruments. 


ion of some repre 


a 


TESTS 277 


I. Interest Questionnaire for High School Students * 


This questionnaire consists of items to which the subject re- 
sponds by indicating liking, indifference, oF dislike. It falls into 
8 sections. There are 68 items which have to do with occupations 
Which the subject likes, dislikes, or regards with indifference; 24 
having to do with activities ; 20 having to do with school subjects ; 
20 having to do with job activities; 41 having to do with school 
activities; r2 having to do with prominent men; 26 having to do 
With things to own; 23 having to do with magazines. The items 


were selected in terms of their ability to differentiate the interest 
ommercial, and technical 


Patterns of students in academic, € 

Courses. There are 3 keys which score the subject’s responses on 
the basis of similarity to these three interest patterns.. It is re- 
Ported that the questionnaire can predict success in the curriculum 
of the subject’s choice more accurately than it is predicted by a 
8eneral intelligence test. The instrument is carefully and com- 
Petently constructed, and is & good sample instance of others of 


the same general type. 


2. Vocational Interest Blank for Men (Strong) t 
This is by all means the most important and highly developed 
of all instruments for the measurement of interest. It has been 
revised and extended from time to time and has been widely used. 

a second revision, now available, is the outcome of twenty years 
ot work and experi 

experience. : . 

The blank EE of a lengthy and elaborate questionnaire. 
It lists too occupations, 38 amusements, 36 school zie and 
Contains 46 items having to do with rey a Fe iting (0 
People. T : onds by indicating liking tb), 
©0ple. To these the subject resp Dy’ 
indifference (I), or pees (D). In addition, the blank calls for 
Self-ratings on various preferences; habits, and traits which were 
Selected because they differentiated petween the interests of a 
large variety of occupational personnel. Its outstanding feature 1s 


lts vari ang. i a st resembles the Interest Question- 
ria age one eet just described, but carries the 


Naire f z 

A or High School Students ‘ ; 
Principle k further. Norms for the various scoring schemes 
Were set up based on the declared interests of persons successful 
n various occupations and of those of 4 large group of “men in 


* 
+ Reference: Symonds, 1930- 
eference: Strong, 1943 
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general,” this sample being chosen to conform to the occupational 
distribution of the United States Census. Scoring keys have been 
worked out based on the interest patterns of 35 occupations, 6 
groups of occupations which are psychologically similar, and for 
3 nonoccupational traits, namely maturity of interest, mascu- 
linity-femininity, and studiousness, and also for occupational level. 
Thus the item responses made by any person can be interpreted 


in terms of their relationship to a large variety of significant inter- 
est patterns. 


PREFERENCE RATING INTEREST GROUP OR 
ITEM 
L I D TRAIT 
Electrical engineer 2 —=5 5 Advertiser 
4 - —3 | Masculine-feminine 
Displaying merchandise | —2 I Advertiser 
in store —2 Advertiser 
2 I I Masculine-feminine 
Writing reports 2 -1 —1 | Personnel managet 
-I —i Accountant 


Fic 26. SAMPLE RESPONSES TO VOCATIONAL INTEREST BLANK FOR 
MEN INTERPRETED FoR VARIOUS OCCUPATIONS AND TRAITS 


How the scoring scheme works out can perhaps best be under- 
stood from an examination of a concrete sample such as that 
presented in Figure 26. An expressed liking for the occupation © 
electrical engineer gives a positive score of 2 in terms of the 
interest pattern characteristic of advertisers, and a positive score 
of 4 in terms of the interest pattern associated with the trait © 
masculinity-femininity. An expressed indifference for this occu 
pation gives a score of —5 on the norm for advertisers, and is not 
related to the trait. An expressed dislike for this occupation gives 
a positive score of 5 on the norm for advertisers, and a score © 
—3 on the trait. A person’s total interest score is the sum of the 
positive and negative values on all items, as rated in terms of the 
occupation, or occupation group, or trait concerned. A person can 


have as many total interest scores as there are norms and key® 
for the Blank. At present this number is 44. 
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A split-half reliability of .87 has been reported for 285 students 
Ps Stanford University, and also a retest reliability of .869 after 
b e interval of a week. The basic claims as to validity, which can 
a substantiated, may be summarized as follows. (a) The Blank 

iscriminates sharply between those who are successful in a given 
Category, and “men in general.” Thus in one investigation it was 
found that only 15% of nonengineers rated A in engineering in- 
terest (i.e., in conformity with the interest pattern of successful 
engineers). (b) Interest scores and patterns correspond well with 


Vocational success. Thus of 181 life insurance salesmen who rated 
67% wrote at least 


high to medium in this interest category, 
$150,000 worth of insurance a year. (C) Personnel experience 
amply validates the Blank. It sometimes misfires, and sometimes 
It is Tesented and disliked. But in general it is an excellent index, 
Particularly when combined with other criteria. 


3. Vocational Interest Blank for Women (Strong)* 
The same techniques of construction, and the same general 
as in the Vocational Interest 


Beanization are embodied here k e 
ank for Men. Norms and keys are available for 17 occupations, 


and for the trait of masculinity-femininity. Some special dif- 


Culties were encountered in connection with this instrument. 
han men, to enter and stay in 


omen tend greater extent t 
various I or eae other than interest. Hence the 
relationship of the interest pattern to success is not so clear. 
Bain, the Blank was standardized on mature women, and this 
Casts some doubt on its use for younger women, as the norms 
developed may not apply well. Finally, the occupations used in 


€ scaling were not homogeneous in all cases. 


Occupational Orientation Inquiry f 
e cited as samples of a 


follows at i 
from the foregoing, or at least of 
yiceability is different. The Occu- 


Pational Orientation Inquiry calls upon the student to give a 

view account of his vocational interests and experiences, and 

en to make a self-rating on 224 occupations in terms of his 
z g 2 


nowled aa it, his ability in it, and his 
ge of each, his interest in 1, , I 
Chance of placement in it. These ratings are made on a 5-point 


~ ingham; St: + 1943. 
7 ferences: Berman, Darley, and P Bingham; Strong, 1943 


€lerences: Wellar; Peatman. 


ee and the example that 
linge type of instrument 
Tuments whose primary set 


aterson ; 
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scale. As a means of presenting in summary and comprehensive 
form a young person’s broad vocational orientation, it is valuable. 
Many similar questionnaires have been prepared, but the par- 
ticular advantage of this one is that it calls for self-ratings under 
the four heads of knowledge, interest, ability, and opportunity. 
This makes it of distinct service for guidance, as may be inferred 
from one report to the effect that 78% of a large group of high 


school subjects had virtually no vocational knowledge but hoped 


to be able to find employment. The authors have attempted to set 
up norms based on interest groups, but these do not appear to be 
well founded. Thus the Inquiry cannot safely be used for dif- 


ferentiation on the model of the Interest Blank for Men, but it 
can be and is a serviceable tool for the counselor. 


5. Miner’s Analysis of Work Interests * 


This is an old, but still valuable, and indeed excellent jnstru- 
ment. It consists of a four-page folder containing numerous ques- 
tions pertaining to vocational interests, and the subject is invite 
to reflect about them carefully before he answers. Its distinctive 
feature is its emphasis upon reflection, and the explicit intention 
that it should be used as a preliminary to a conference wit! 
counselor. 

These last two instruments, though practically valuable, are 
not of any wide psychometric interest. They are included here 


chiefly to give the reader an idea of the very numerous si 


nilar 
examples that are available. 


So in general, in the measurement of interest we have one of the 
most successful fields of psychometric endeavor. The reason for 
this is evident. It is the possibility of formulating clear-cut and 
meaningful concepts, and of translating them into items whic 
are at once viable and significant. 

6. Kuder Preference Record + 


The Kuder Preference Record consists of 14 sets of 3-choice 
items. The subject is instructed to indicate which he likes Jeast 
and most by punching holes in the appropriate positions- 

instance of such an item is: “Visit an art gallery: Browse mig 
library: Visit a museum.” This is a modification of the earlier 
plan, under which the subject was instructed to indicate a prefer 


* References: Bingham; Miner. 
ý References: Kuder, 1946; Super. 


Ea 


a 
r 
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ence between two situations. Kuder (1939) found that such ex- 
pressions of preference had sufficient stability to be used as test 
items. There is no time limit, but the time required is usually 
about’ 40 minutes. 

_ The scores on the record are classified in terms of the following 
nine areas: mechanical, computational, scientific, persuasive, 
artistic, literary, musical, social service, and clerical. Lists of 
Occupations are presented under 89 combinations of the various 
areas. To show the relation between occupations and areas, the 
Mean scores of two occupations in the nine areas are presented 


in Table 3r. 

Mean profiles such as those shown in Table 31 are provided 
for a number of occupations. It has been remarked that the num- 
bers involved in determining them are quite small, but the occupa- 
tions differentiate according to expectation. Thus, accountants are 

igh in the computational area, and actors in the artistic, literary, 
and musical areas. Preference profiles have also been found con- 
Sistent with college curricula. There is a significant, but not high. 
relationship with school achievement. The highest correlation re- 
Ported is .419, for the science scale (area) with general science 
achievement for women. There seems to be little relationship be- 
tween the Kuder scales or areas and the primary mental abilities 


T . 
€vealed in the test by Thurstone. 

t has sometimes been said that the Kuder scales were developed 
a priori, ie., that the items indicating scientific, computational, or 
Persuasive preferences were designated largely by speculation and 
inspection, However, the manual shows that they are based on in 
€tnal statistical consistency and mutual independence. As com- 
e Preference Recor¢ 


Pared wi rest Blanks, th 
zi ihe Sale aS the vocational activity type. 


Sed on i ie. 

vW. bless Aad id oe The Kuder Preference Record 
Feces better than the Strong nan Blank for Women in 

erentiati ional interests of women. | 
he ee dea Peer been shown to be reliable enough fot 
Counseling, There is, moreover, 2 fair agreement between Strong 
Kuder results (v. Triggs). However, the two instruments are 
sufficiently divergent in method and purpose So that one cannot 
e used to validate the other, nor is either one a substitute for the 
Other (Super). The Preference Record has been found difficult tc 
use with oth graders, due to lack of comprehension of the language 
employed (Christensen). A number of studies have appeared re 
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nships between occupations and 
ips do not always appear (v. 
ewis (q.v.), using as his sub- 
zo female social workers, in- 
he Preference Record and the 
Inventory. He reports that 
tions tend to make 


Porting normally expected relatio 
aie but such relationsh 
Bolanovich and Goodman). J. A. Li 
ath male insurance agents and 
eg the relationship between t 
nesota Multiphasic Personality. 

ose relatively uninterested in their occupa 
more abnormal MMPI scores. 


Measures OF ATTITUDE 

3 The concept of attitude has been understood in at least two 
omewhat different senses, both of which have had an influence 
Upon the construction of psychometric instruments. According to 
7), an attitude is “the sum 


turstone and Chave (q.v, PP- 6- tituc t 
otal of a man’s inclinations and feelings, prejudice or bias, pre- 


Conceived notions, ideas, fears, threats, and convictions about any 
it is often considered and 


pestis topic.” On the other hand, € 
reated as an underlying disposition towards overt action, a per- 
vasive orientation towards life (e-8» Vernon and Allport). Numer- 
Ous attempts have been made to set up scales in terms of both 
Concepts, and it would not be hard to find intermediate or com- 


Promise examples. 


Tt. . 
Specific Attitude Scales 


Scales intended to measure or regi 


t 
pic are constructed by a number © eee 
1. First, there is the so-called method of equal-appearing 1n- 


tervals Á : “Thurstone” method, but 
als. This i called the ur , bu 
o ea although Thurstone and his 


the desi e z nate 
associates Pree Api Ee ‘prominent in utilizing it (v. Thur- 

One and Chave). 
aie ate scale of this type 
to embody and revea 


ster attitudes on some specific 
f different procedures. 


number of statements 
] the presence of the attitude in 
question in varying degrees. For instance, a scale for the measure- 
Ment of attitude towards the church may contain such statements 
as the following: “I would rather go to church than do anything 
else”; «T Jike to go to church” «Going to church does no harm”; 
Going tö reh bores me 1 would never set foot inside a 
Church,” Clearly these statements indicate a range from extreme 
Preference to ese rejection on the specific matter of church- 


consists of a 


284 e PSYCHOLOGICAL TESTING 


going. The arrangement of such statements so that score values 
can be attached to them then becomes the manifest problem. This 
is done by submitting them to a jury, the members of which pro- 
ceed to rank or rate them in order according to some simple plan 
of work. In the instance just given, the statement “I would rather 
go to church than do anything else” ranks in positive value above 
the statement “I like to go to church,” and thus receives a higher 
positive score. The scale value or score value of each item or 
statement is the central tendency of the jury ratings. Since there 
is rarely complete agreement among the jury i 

significance of such statements, the value of each is to some extent 
ambiguous, and the ambiguity of the item is measured by the 


spread or dispersion of the ratings. Also, it often happens that 
persons will accept a given stat! s 


as to the indicative 


ement, and in addition accept some 


other statement far removed from it in scale value. This involves 


an element of inconsistency, and the degree of inconsistency found 
in any given item is measured by the manifested tendency for 
such inconsistent ratings to appear. On this basis the statements 
are selected, and then arranged in a linear order to constitute the 
scale. Any person’s attitude towards the topic in question, all the 
way from high positive to high negative, is determined and given 
a numerical score by his checking what scale statements are 
acceptable and unacceptable to him. The score value of the state 
ments, it should be explained, is calculated in terms of standar 
deviations of the dispersion of jury ratings. Thus a statement which 
secures a mean jury rating of 2.5 S.D. above the mean for all 
statements would be given a high standard score, and so for all 
other score determinations, mutatis mutandis 

2. Another method of constructing attitude scales is known 48 
the method of summated ratings. Just as the technique of equal- 
appearing intervals has been especially associated with Thurstone: 
so the latter procedure is sometimes designated as the “Likert” 
method. án a scale of this type, a number GE item-statements eal- 
ing with the topic are presented, and the person responding is 
instructed to check them in terms of five choices (strongly approve 


approve, undecided, disapprove, strongly disapprove)- The ac 
vantages of this method, according to Likert (q.v.) are that it does 
away with the need for a jury of raters such as that use 
Thurstone, it makes scale construction much less laborious, an 
produces a scale which has as good a reliability as that of the pre- 


vious type. Various studies have appeared which seem to indicat® 


tn 
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ma n scale using summated ratings is no less reliable and valid 
: ae e ‘Thurstone type” scale (v. L. W. Ferguson ; Farnesworth, 
Pe ; Riker). In fact, the idea of being able to locate a statement 
ks designated point (usually the midpoint) of a continuous 

ar series has seemed difficult and questionable to some investi- 


gators. 
esents a further refinement 


i 3. The work of Guttman (1947) represents 3 
n attitude scale construction and evaluation. He introduces a 

Criterion of scalability, on the basis of which it is found whether - 
and to what extent we can reproduce a person’s response to all 

items from his response to one. For instance, if on one scale item 
60% agree, 10% are undecided, and 30% disagree, then the highest 
60% of individuals classified on total scores must be those who 
register “agree” on this item, or there is imperfect scalability. This 


eads to a technique for combining and sifting items to improve 
t, which also makes possible a re- 


es Scalability of the instrumen é 

uction in its length. Some other investigators have found Gutt- 

man’s technique difficult if not impractical. A further important 

contribution of Guttman has been to recognize and provide means 

fön measuring the intensity with which an attitude is maintained 
Y a subject. 


2. Generalized Attitude Scales * 


conemmers has pointed out that wal € 
nd, it is also very laborious. In particular, it opens the way to 
e construction of almost innumerable specific attitude scales. 

€ has proposed to retain the rigor of the method and avoid the 
Practical drawback of scaling enormous numbers of specific atti- 
tudes by constructing general attitude scales. A very considerable 
umber of these, also, have been developed; and since once more 

ey are all einglay i principle and technique, no one is selected 
or discussion here. Generalized attitude scales constructed 
bY Remmers and his associates Will be found listed in the bibliog- 
raphy of tests at the end of the b k under the heading Attitude 


hile Thurstone’s technique is 


00! 


A generalized attitude scale, according to the explanation oien 

emmers, consists of affective statements or stereotypes, a 0 
Which appl ' alid] toa psychological continuum representing 
attitudes E tie ai body of objects, such as nations, races, 
"stitutions, vocations, political parties, and the like. In plainer 


* References: Remmers, 1934 b, 1934 c, 1936, 1938; Remmers and Silance. 
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language the idea is to develop a scale that will measure ane 
not to the Negro race specifically, but to any race; not to E 
that school subject, but to school subjects in general, or Sage 
any school subject. The purpose is, of course, to reduce the la 


of construction and administration and still have instruments that 
can be widely used. 


The pioneer scale of this type was for the measurement of ae 
tude towards “any school subject.” It contained such me 
as the following: “I look forward to this subject with horror ; 
have seen no value in this subject”; “ I don’t believe this subject 


would do anyone any harm”; “This subject is all right” ; “This 


subject is a good subject”; “I really enjoy studying this subject” ; 


“No matter what happens, this subject comes frst” (see Remmers 
and Silance). 

We now turn to consider three representative attitude tests 
which are based on the conception of attitude itself, not as 4 sys- 
tem of evaluations with reference to some specific phenomenon ol 
theory, but as an underlying and pervasive disposition in the 
individual that shapes and colors his reactions to life. 


3. A Study of Values * 


This test is divided into 2 parts. (1) is a set of items calling 
for statements of preference as between two fields of activity- Each 
item is to be answered yes or no, and the degree of preference 1$ 
to be indicated on a scale from o to 3 points. Such preferenti@ 
statements as that business should be operated for profit rather 
than for service, or that scientific research should be for the dis- 
covery of truth rather than for practical applications are set ur 
(2) is a set of 4-choice items, each to be ranked in order ©” 
preference; for instance, that the function of government is PE 
marily to relieve the needy, to develop business, to make pol 
more ethical, or to achieve power. These items are scored OP 5 
differential scheme, which yields a profile showing the importane 
the subject attaches to 6 kinds of values; namely, theoretica 
economic, aesthetic, social, political, and religious. ts 

Cantrill and Allport (q.v.), in a study of this and similar an ; 
report that the scores for social values are the least reliable a 
discriminating. But they claim to show that the other items, W a 
scored in terms of the indicated norms, “select consistent, P 

* Reference: Vernon and Allport. 
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TS nd above all generalized traits of personality.” 
heir importance, it is claimed, lies in the fact that a person’s 


behavior is not determined by the immediate situation, or by 
transient interest, but by general evaluative attitudes to which the 


term values is attached. 

As E. B. Greene (g.v.) remarks, this latter contention may be 
Correct, but when the attempt is made to translate it into a psy- 
chometric instrument based on a well-defined concept or concepts 
Which focalize a set of test items, very serious doubts arise. Can 
Such an instrument truly reveal these long-term, deep-seated, 
orienting trends in the personality? In the present instance each 


of the six value categories set up seems to be a collection of 
han a clear unitary trend. 


Superficially relevant items rather t 
€oretical values include concern V ith the discovery of natural 
aws, mathematical relationships, and scientific facts. Economic 
Values are embodied in items having to do with activity in real 
ional training, which certainly 


e: š 
R, finance, industry, and vocation tainl 
Seem a heterogeneous collection. Social values are embodied in 


items having to do with a sensè of responsibility to others and 
their needs, with unselfishness and sympathy, some of which might 
well overlap religious values. Political values are discriminated 


Y items having to do with government and political affairs, the 
uisition of professional and 


©xplorati d the acq 
i tak tae wold, am of which is far from clear. 


Si 4 
~ cial prestige, ihe internal relevance a ; 
Acligious values are supposed to be revealed by items dealing 


With the abolition of war, laying up treasure in Heaven, reverence 
M church, belief in God, and the evaluation of life as a whole. 

1S clear that the score on each of the 6 values is obtained by 
Tesponses to a confused jumble of items, held together rather on 
the analogy of material in a filing system than in terms of true 
Psychological coherence and unity of meaning. Nor are the in- 
ternal consistencies of the score-patterns for the various values 


aS report ry reassuring. 
ed by Allport and Vernon very ng 
is, as riicht be well expected, 15 the prevailing weakness of 


struments of this kind. Universal life values are presumably 
Very important, though just how consistently people are motivated 
hem may he a question. But to uncover them by means of a 


Psycho Ene nn. eacult if not an impossible task 
m a difficu , 
Etric knee | zhat they mean well enough for 


ecause 

althoug e may know W > Š E: 
oral ee hey ae not defined with sufficient clarity for 
e p 

ntal measurement. 


vasive, enduring, a 


m 
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4. Test of Public Opinion * 


This is an interesting variant on what has already been dis- 
cussed, not so specific as the attitude scales, not so sweeping 10 
its claims as the values tests. It undertakes to reveal prevail- 
ing attitudes on social and public matters. There are 6 subtests. 
(1) Crossing out from 51 items every word that arouses more 
disagreement and annoyance than agreement and attraction. (2 
Rating 53 statements on degrees of truth. (3) Paragraphs followed 
by a number of short statements, to be checked as to whether oF 
not they are logical consequences of the paragraph. Prejudice !$ 
supposed to be revealed by including some statements that do not 
follow logically. (4) Approving or disapproving on moral grounds 
various described situations. (5) Distinguishing between strong 
and weak arguments. (6) Generalization test much like (2). Scor- 
ing standards were worked out on the basis of jury ratings by 4 
group of judges. The scores purport to reveal prejudice or bias 
along 12 lines, i.e., for or against radicalism, capitalism, econom1¢ 
liberalism, social gospel, personal communion and mysticism, fun- 
damentalism, Christian modernism, religious radicalism, protes- 
tantism, Roman Catholicism, puritanism, libertinism. 

Quite a large number of ingenious tests of this order have bee? 
produced (cf. Murphy and Likert), but their validity is open t° 
very grave doubt. They deserve attention from students of psych” 


metrics as examples of test construction rather than as instru- 
ments for serious practical application. 


5. Pressey Interest-Attitude Tests + 


This instrument can probably be treated as appropriately in 
connection with the general category here under discussion aS any 
other, although to be sure it has some of the aspects of an interest 
test. It is intended for ages from grade 6 to the adult level. Accord- 
ing to its authors, it provides “a simple and expedient way o 
investigating the maturity of the interests and attitudes of a group 
with respect to a large number of items.” It consists of 4 subtests 
each of 90 items. (1) Things the subject considers wrong: (2) 
Things that interest him. (3) Things he worries about. (4) Cha” 
acteristics of persons he admires. The response required is simples 
consisting of putting an X in front of words arranged in columns 


* Reference: Goodwin B. Watson. 
+ Reference: Pressey and Pressey, 
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If he has very strong feelings he can put two X’s. The items, 360 
in number, were very carefully chosen from a list of 950 on the 
basis of their power to differentiate between younger and older 
Subjects. Norms for boys are reported based on 2,000 cases, and 
for girls, on 2,088 cases. The split-half reliability of .94 to .96 
has been reported for the whole test for single grade groups, and 
of about .85 for the various subtests. Validity appears to be 
Teasonably satisfactory, for the test correlates with estimates of 
emotional maturity made by guidance workers .66 to .72. 

To sum up, then, specific attitude scales, and generalized attitude 
Scales developed as efficiency devices, have proved feasible and 
Satisfactory. The same cannot be said of tests designed to reveal 
deep-seated and controlling values. The reason is clear. The more 

€finite the controlling concept, the better the instrument. 


Measures or Morar CONDUCT AND CHARACTER 


A very considerable number of tests and scales which attempt 
to measure various aspects of moral conduct and character are 
available. As good representative examples as can be found are 
e C.E.I. (Character Education Inquiry) Tests.* Instead of de- 
detail, some of the more ingenious 


scribing some or i 
i à all of them in 
items and devices embodied will be presented here, so that the 


reader ma i f what is done in work of this kind. 
y form some notion 0 ne | 

full li ear in the bibliography of tests at 

isco A following brief account is based 


€ end of the book. Much in the 
Upon the i ented by E. B. Greene (g-v.). 

One of ie sais patarles of this extensive set of tests has to do 
With Moral Knowledge and Opinion. Characteristic items em- 
Ployed in the separate tests in this category are as follows. There 
'S a set of true-false statements which are intended to measure 
Cause and effect in the moral realm, as for instance, “God pun- 
Ishes baq people by making them sick.” There is a set of multiple 
Choice recognition items, in which the subject is to indicate 
whether certain described actions are classifiable as cheating, or 
ying, or stealing, or as some other kind of offense, or not wrong 
at all. There is = vocabulary test, in which the meaning of a list 
of words with moral connotations is to be indicated. There is a 
Set of so-called “free response” items, in which a situation with 

* References: Hartshorne and May; Hartshorne, May, and Maller; Harts- 


horne, May, and Shuttleworth. 


290 PSYCHOLOGICAL TESTING 


moral implications is described, and the subject is to write down 
whatever he thinks might happen, being scored on the number of 
important or probable consequences indicated. The same situa- 
tions are also used in a test which requires the subject to say 
whether certain stated consequences are probable, possible, or 
would not happen at all. 

Another sub-battery is made up of tests having to do with 
conduct, and specifically with the qualities of honesty, service, 
and inhibition as expressed in conduct. For the measurement of 
honesty, copying, self-scoring, and improbable achievement tests 
have been invented. The copying technique requires that subjects 
be seated in pairs, each member of a pair having what looks like 
the same objective test to work upon, the order of responses, 
however, being different in the two tests. The self-scoring tech- 
nique consists in running a test, having it handed in and score 
without the knowledge of the subjects, and then returned to them 
for self-scoring. The improbable achievement technique consists 
in running the same test twice, once under supervision So that it 
would be impossible to cheat or raise the score, and once without 
supervision. Quite a range of very clever “cheatable” tests have 
been produced. As a test of the subject’s tendency to lie, he is 
asked questions about cheating on the above, without his knowing 
that it is possible to check up on his auewers: 

Again, various methods have been invented for determining 
cooperative tendencies. Choices are given between working for 
the class or for oneself to win a cash prize that might be awarde 
either to the individual or the class. Votes are taken as tO the 
allotment of prize money either to the subjects or to a sick chil 
ina hospital. Opportunities are presented for sharing possessions 
by giving each subject a box containing 10 articles, with permis- 
sion to keep what he likes or to make donations for boxes to 80 
to poor children. Again, the child is given 4 envelopes and aske 
to find or make up jokes, puzzles, pictures, etc., to be enclosed in 
the envelopes as gifts to children in a hospital A systematic recor 
of cooperative services during a period of time is set up. Ver is 
descriptions of cooperative action in the form of short episodes 
involving boys and girls are formulated, teachers being asked t 


match each boy or girl under consideration with the approximate y. 


corresponding verbal portrait. A check list of a large number ° 
traits is provided, with concrete characterizations of cooperat” 
acts to be applied to each child by the teacher. 
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gaa inhibition and persistence, the following devices have 
en invented. An exciting story is read up to the climax, at which 
ee the ending is set up with capital letters entirely and words 
come with both small letters and capital letters and with 
eae. S run together, and with spaces wrongly placed between the 

Pital letters, the task being to separate the words with pencil 
Marks to facilitate reading. A page of pied type is presented, the 
task being to count the letters. As a distraction test, lines of digits 
to be added are printed among curious pictures and lines, the 
rng being the difference from adding time with normal presenta- 
tion. As an inhibition test, various temptations are applied, such 
aS Presenting a piece of candy not to be eaten until a given time, 
or a small safe also not to be opened until a given time. Also a 
check list of words indicating types of self-control is provided for 


ratings by teachers. — 
.. There are also various tests of moral opinion, including lists of 
‘tems to be checked if they indicate duties, described situations 
in connection with which the subject must say what he would do, 
= tests involving the relative importance of various moral 
'nciples, 
; One is impressed with the ingenuity with which the item con- 
ent of these tests has been fabricated. ‘Indeed, they are full of 
devices— one might almost say of gadgets—which the psycho- 
Metric worker may well find interesting and even suggestive and 
€lpful. Tests of moral con acter seem very much 


Or p 

sae trait of moral excellence. 

situati what a subject says he = ¢ 
„uation and what he actually does 1n @ 1 

F, 1 what h y i ee 
N, Fri rted negative correlations between 

aan Taata d the actual tend- 


the ahi; 3 
ability to say what is the right thing to do an € 
ency to do it P hEn the action concerned is undesirable, the 


Coefficients ranging from —.13 to —-44 for individuals and groups. 
Owever, he reports positive correlations of .23 and .53 for verbal 
€cisions and actual behavior in the case of desirable actions. 

f: 


lear] her instance of a tempting field 
another . k 
y Te have kere yal avor, in which at the same 


for 
Or Psychometric research and ende: i 
time little success has been achieved because of the lack of 
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manageable and definable concepts about which viable and sig- 
nificant test items can be organized. For the general statement 


may be made that tests of moral conduct and character are of 
very doubtful validity. 


SUGGESTED ADDITIONAL READINGS 


For additional reading and more intensive study of the material in 
this chapter, the most important sources are the tests themselves, and 
more particularly the manuals of the tests discussed. Publishers will be 
found listed in the bibliography of tests at the end of the book. Also 
references mentioned in the text in connection with the various tests 
may be consulted. Further readings and suggestions are as follows: 

Harold D. Carter, Vocational interests and job orientation (Appli¢ 


Psychology Monographs, No. 2: Stanford University, Calif.: Stanford 
University Press, 


1944). A comprehensive survey and evaluation 0 
ten years of work. 
Edward K. Strong, Jr., Vocational interests of men and women 
(Stanford University, Calif.: Stanford University Press, 1943): , l 
Quin McNemar, “Opinion-attitude methodology,” Psychologica 
bulletin, 43 (1946), 28 k 


( 9-374. The most thorough and complete survey 
and evaluation of the field available, 


Ellis, Albert, “The validity of personality questionnaires,” Psycho- 
logical bulletin, 43 (1946), 


: an 
appraisal of the field. idicdiiinshcnceecnimnias 
Oscar Krisen Buros (editor), The 1938 mental measurements Year” 
book (New Brunswick, N., Jz Rutgers a A Press, 1938); @ g 
The 1940 mental measurements yearbook (Highland Park, N. J-: ees 
Mental Measurements Yearbook, 1941). These valuable reference 
works should be consulted for test reviews and bibliographies. _. 
Gertrude H. Hildreth, A bibliography of mental tests and rating 
scales (2nd ed.; New York: Psychological Corporation, 1939); = 
A bibliography of mental tests and rating scales. 1945 supplemen 
Mey X ork Psychological Corporation, 1945). These very comple’ 
oe ie = useful sources for any additional tests it " 


QUESTIONS FoR Discussion 
1. Examine and compare the various working concepts used in s 
construction in connection with personality measurement t age 
embodied in the instruments here described. be 
2, Which of the various tests here presented might justifiably p“ 
classified under different headings from those under which they i 


5 


m 
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Pear in this chapter? How might they be reclassified? What would be 
the advantage of so doing? 
„„ 3: Examine the list of interests of bright and dull children in 
Table 30. Might a test be set up on this basis that would discriminate 
between the bright and the dull? È f 

4. Critically examine the concept of the personality quotient, and | 
t < Procedyte for computing it, in comparison with that used for 


5. In a case of practical group or individual guidance, would you 
expect an interest test to contribute anything not derivable from an 
intelligence test? Would the opposite be true, i.e., a contribution from 
the intelligence test not derivable from the interest test? In each case, 
what, if anything, would such a contribution be? 4 , 

- Do you find any tests here listed, besides the Miner Analysis of 
Work Interests, that might be very useful if employed in connection 
With conference or discussion, but perhaps not of much use if em- 
Ployed simply to obtain a score? A 

7- Why might it be impossible to measure by psychometric means 
the Ominance and pervasiveness of religious values, and yet possible 
to measure rather definitely a person’s attitude towards God? 

- Consider carefully the relative excellence of the Bernreuter Per- 
Sonality Inventory and the Minnesota Multiphasic Personality In- 
heer Give reasons for your evaluation, State any psychometric 

ciple š i 

alle canny yen Vocational Interest Blanks (for 
men and for women) and the Kuder Preference Record, from the 
Standpoint of mode of construction, psychometric principles, and 
ime aa ts here discussed require self-ratings. 
What as sd fine TERE of serious falsification in the replies? 

11. If you yourself or any one individual undertook to scale a 
number of statements revealing the strength of a gin attitude, 
using the best possible judgment and common sense in the scale place- 
ment of the statements, in what respects would the result be less 
trustworthy than those obtained by Thurstone’s method? 


CHAPTER IX 


APPLICATIONS OF MENTAL TESTS AND THEIR 
IMPLICATIONS FOR TESTING 


INTRODUCTION 


Having discussed in some detail the chief types of psychometric 


instruments at present available, the next major consideration 15 


to evaluate what has so far been achieved in the testing movement 
and the work along various lines which is being done looking 
towards its further development. For this the necessary starting 
point is a review of some of the major and representative applica- 
tions of mental tests and an interpretation of the most important 
results that have been gained. Such is the purpose of the present 
chapter. 


The applications of mental testing have been, of course, enor- 
mously widespread and various, and it is not possible in the com- 
pass of this volume to summarize them all, even briefly. Accounts 
of such applications along many lines will be found in many o 


the references here contained, and excellent summaries are prer 


sented by Pintner (1931), Anastasi (1937), and others. No at 


tempt will be made here to cover this ground. The purpose is not 


to acquaint the reader with the factual results of testing for their 
own sake, important though they undoubtedly are. Our concern 
is rather with the psychological content and significance of menta 


tests, as we now know them, as a necessary basis for their intelli- 


gent use and possible improvement. If an understanding of thes? 
things is to be achieved, it must be done in the light of the result 
and findings that have come from the major applications © psy” 


chometric methods and instrumentalities. 


MENTALITY AND SOCIOECONOMIC FACTORS 


There is a wealth of evidence that mental traits in general; 
and more particularly intelligence, are more or less closely relate 


: h ae 
to socioeconomic factors such as occupation, home condition n 


institutional conditions, community conditions, and the like. 
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mentality which is measured in various ways by psychometric 
runients is not something that exists and functions in isolation 
rom the general circumstances of life, and it is important to 
understand as clearly as possible what the relationship is, because 


it deeply conditions the proper use of the instruments themselves 
ang the interpretation of their results. 
Pret since the analysis and publication © t 
ny testing in World War I, the relationship between mentality 
= socioeconomic circumstances has been recognized. As hereto- 
ae pointed out, it was then found that occupations can be ranked 
a rough hierarchy in respect to the median intelligence of those 


engaged in them. The evidence was brought together and sum- 
f his findings are 


marized by Fryer (1922), and some samples of his fi 
Presented in Table 20. The relationship between intelligence and 


Occupation, although regarded as significant, was not very definite. 
pping, and although success 


heey was found to be much overlappins altho 
a given calling might require a minimum intelligence there 
Was no clear-cut upper limit. This work has been widely supple- 
mented and amplified in detail, and various modifications and 
uncertainties have appeared. It has in general been confirmed by 

arrell and Harrell, to whose study reference has been made, for 

orld War II testing. Also, there has been a great deal of debate 
about the reason for the relationship. On the one hand, it might 


ge that a given low-grade occupation tended to select hereditary 

. D . ct 
ntelligence. On the other hand, it might be t 
g intelligence, 


tion had in 
Venting ee ar Lake on we shall have to return to 
ese considerations. But whatever the cause of it, the fact of the 
relationship itself was of obvious importance for guidance and for 
ip Practical use of test results. Among all the evidence m fnd- 
gs that have emerged, however: the most important oY our 
Present purpose is that occupational differences 17 particular, and 
Socioeconomic differences in general, reflect themselves not only in 
hie intelligence of the adult workers in various callings, but also 
that of their children. 
l; 


of the results of the 


hat such an occupa- 
or at least of pre- 


Child intelligence and parental occupation . 
There there is a basic relationship 
se be no doubt that a h 

between ene of children and the occupation of their 
Parents. This has been formulated py Terman and Merrill (1937 b) 
™ one of the best and clearest studies of the subject. Their over- 
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all findings are summarized in Table 32. It will be noticed that 
there is a difference of roughly 20 points in the mean intelligence 
quotients of children whose parents are in the most favored and aT} 
least favored occupational groups. At the same time, the over- 
lapping is very considerable. But in spite of this, the relationship 
manifests itself as significant. Nor does it seem to vary much with 
the chronological age of the children concerned. The data pre- 

TABLE 32 
Mean Stanrorp-Brver 1.Q.’s CLASSIFIED ACCORDING TO FATHER’S 

OCCUPATION 


(Terman and Merrill, 1937 b, p. 48) 


ES 
FATHER’S OCCUPATIONAL CHRONOLOGICAL AG 


CLASSIFICATION 2-514 65 ro-14 | 15-18 
ee 

I. Professional. « sss sacs cies 116 11g 118 116 

II. Semiprofessional and man- 

ARONA ss divas iwir 112 107 112 say 
TII. Clerical, skilled traaes, re- 
tall O 108 | ros | 107 | "° 
IV. Rural. owners) essa asiana 99 95 92 94 À 
V. Semiskilled, minor clerical, 4 
minor business ......... n 104 105 103 107 2 

VI. Slightly skilled .........., 95 | ro | tor 9p 
VII. Day laborers, urban and ru- 

SATA aa. BF 96 97 8 
sented by Terman and Merrill are based on the standardization 
group of the Revised Stanford-Binet scale, which, the reader wil 
remember, was large, representative, and drawn from many pare 
of the United States. Moreover, the group was homogeneol 
with respect to race, consisting entirely of American-born white 
children. 

Similar findings have been consistently reported for othe’ f 
groups, and also when intelligence tests other than the Stanfor o 
Binet scale were used. Thus Goodenough (1928 c) tested er 
children, 190 being boys and 190 being girls, the age distribut!? 


APPLICATION OF MENTAL TESTS 297 


being 
Ce A 2 years old, 126 of 3 years old, and 132 of 4 years old. 
comparabl e Kuhlmann Revision of the Binet scale, and obtained 
Hore mies results, with occupational-intelligence differences even 
aun | ed than those reported by Terman and Merrill, as will 
Sis eve rom Table 33. Coffey and Wellman (g.v.), again, ran 
tlie Gi 6 months from 1921 to 1934 on 417 young children at 
ing the ersity of Iowa Child Welfare Research Station. Classify- 
they abt in the occupational groupings adopted by Goodenough, 
eti ained similar results. The tests used were the Kuhlmann- 
the Bi or those under 3 years old, and the Stanford Revision of 
inet scale for those over that age. Haggerty and Nash (q.v.) 


TABLE 33 
TED BY OCCUPATIONAL 


CA. 
p. 287) 


MEAN INTELLIGENCE QUOTIENTS CLASSIF 
Groups: CHILDREN 2 70 4 


(Adapted from Goodenough, 1928 c, 


FATHER’ 
HER’S OCCUPATIONAL CLASSIFICATION 


I Professi 6 
W Sen onal nene E aaia 3 5 125.0 
UL c] iprofessional gia win os . 29 119.7 
erical and skilled labor - 129 113.4 
79 108.0 


IV, ee 
V. Semiskilled and minor clerical. 
VL Slightly skilled .....+6-8 090° 
e Unskilled enese sereen ttt 


e Examination Delta ı and 
r 8,000 children, found that 


elligenc' 


elta 2 with a total population of ove : 
Jf with those of their group who 


again ae: 
Sain, using the Haggerty Int 


e > X š 
oe relation manifested itse ) 
€ in grades 3 to 8, as will be seen from Table 34. Occupational- 


intel; 
c telligence differences, however, were py aogneans s marked for 
children enrolled in high school, which is a significant and interest- 


ing finding. h 
erman (192 again tion with his monumental 
studies a t = a rtion of highly gifted 
Dbe is greatest among next in the 
"IC service group, and lowes 
T ee results have been reporte 
mpson (q.v.), in their survey of in 


in connec 
d that the propo 
the professional group, 
t in the industrial group. 

d from abroad. Duff and 


telligence in the county of 
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Northumberland, England, tested, in all, 13,220 children in the 
elementary schools and 405 children in the secondary schools. 
Children of “brain workers” made the highest scores, the number 
being 1,722, and the mean L.Q., 106.6. Children of “hand workers 
made the lowest scores, the number being 10,848, and the mean 
1.Q., 98.6. 

To sum up, then, the significant points are these. (a) Parental 
occupation reflects itself in the intelligence scores of children. 
(b) Urban and rural differences also reflect themselves in the 
intelligence scores of children. Attention has not been called to 
this, but the reader will notice that it is indicated in the tabula- 


TABLE 34 


CHILDREN’S MEDIAN INTELLIGENCE QUOTIENTS CLASSIFIED BY OccuPA- 
TIONAL Groups: ELEMENTARY SCHOOL AND HicH SCHOOL 


(Adapted from Haggerty and Nash, pp. 569-70) 


CHILDREN Hicw ScHooL 
FATHER’S OCCUPATIONAL | Grapes III to VIII PuPILS 
CLASSIFICATION 
°| N |Medianro.| N |Medion1® 
= 
I. Professional ........ 349 116 201 ane 
II. Business and clerical. . 944 107 374 ane 
IIL. Skilled labor . 1028 98 54 ue 
IV. Semiskilled labor a os 267 x08 
Vo Farmer musas see] 3098 91 48 108 
VI. Unskilled labor ...... 145 89 489 106 
ee 


z f 
tions. (c) Both the above points seem to be true for children ° 


a considerable variety of ages. (d) The relationship mani 
itself when different intelligence tests are used, and when som 


z A ig not 
what varied occupational classifications are set up. (e) It a 


. . | 
so marked for pupils in high school as for elementary ae 
children and preschool children, which is probably due tO 
selective character of secondary education. 


2. Child intelligence and community setting 


s ct 
Once again, broad differences in community setting u 
themselves in the intelligence test performance of children- 


tr 


me’ 
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smationship is clearly and typically shown in Table 35. The two 
MWe a iunities where the tests were run differed markedly in 
Pibwl a | and general advantages. The Kansas town had a growing 
tine improving library facilities, good recreational oppor- 
Site Hes, and so forth. The Ohio community was exactly the oppo- 

in all chief ratable respects. The extent to which these dif- 


e : 
rences showed up in the test performance of the children is 
urces of this 


striki . 

Pitre revealed in the tabulation. (For primary so 

iar oe see Pintner, 1917; Paterson, 1918; Pintner, 1931.) Simi- 
t differences are found to appear as between the children of 


TABLE 35 
INTELLIGENCE RATINGS OF CHILDREN IN Two CoMMUNITIES, AFTER 
PINTNER (1917), PATERSON (1918) 


(Quoted from Pintner, 1931, p. 246) 


Kansas Ohio 
Mental Ratings percentages | percentages 
i. ara 
ga BUEH sorire sursa inn AA 4a : : 
Tight .. 15. 
ormal . i 66.0 65.6 
eg A me a 
Dud verre re B 
Number of cases o specesecnsieneee or 332 154 


Se ee 
different cities which are definitely separable in socioeconomic and 
Cultural status (Pressey, 1919)- Very great differences have been 
und also in the intelligence test performance of children in dif- 


ferent schools in the same city. Thus Dickson and Norton (9.0) 
gth grade intelligence in 29 


ave re g e 
Schools A ete - A points in terms of test scores. with 
ae Median at 81, and that intelligence averages are closely asso- 
Ciated with the socioeconomic status of the immediate neighbor- 
hood of the school. Maller (q.2-), t00; studied all sth grade chil- 
ren in 273 health centers in New York, the total number of 
Subjects being 100,153. He ding differences in per- 
g : i 
pa nance on the National 
apid Survey Test, the mean T.Q. 


Intelligence ] r 
of the lowest rating area being 
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74, and that of the highest, 118. Here again levels of intelligence 
test performance closely parallel socioeconomic status and cir- 
cumstances. Finally, Thorndike and Woodyard (1942), studying 
the National Intelligence Test scores of 6th grade children in 3° 
cities, find the usual wide range and overlapping. Also they report 
high correlations of test scores with “goodness of life,” “persona 
qualities of residents,” and “per capita income.” 

Sherman and Key, and Sherman and Henry (q.v.), in one of 
the most important and interesting investigations of this problem, 
made a study of the residents including the children of four 
“hollows,” i.e., isolated and roadless communities in the Southern 
mountains, and compared them with the residents of a sma 
village where there were far more amenities, conveniences, out- 
side contacts, and general civilizing influences. In testing they 
used the Stanford Revision of the Binet scale, the National Intel- 
ligence Tests, the Pintner-Cunningham Primary Scale, vario“ 
performance tests, and the Goodenough Drawing Scale. As may 
be gathered from Table 36, which brings together some of therr 


TABLE 36 


INTELLIGENCE RATINGS or “HoLLow Fork” AND VILLAGE CHILDREN 
(Sherman and Key, p. 283) 


MOUNTAIN AGE 
Test (“Horrow”) ai 

N Av.1.Q. | N Av. 10 
Stanford-Binet ........., 32 6 
National Intelligence Test.| 24 rn 96-4 
Pintner-Cunningham ..... 42 me =) 876 
Performance Tests ....... 54 es a 1186 
Goodenough ....... us 63 a ns 


results, the village children were s an norms) 
but the test performance of the challog chiar ae wee owe" 
Moreover, the mean intelligence of the latter varied in relation, 
ship to the isolation and lack of privilege of the “hollow” ee 
they lived. This work is of peculiar significance because oft 
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ded performance tests. Simi- 
v.) in his frequently cited 
ften illiterate children of 


Nien of tests used, which inclu 
oe = were obtained by Gordon (q. 
aa = underprivileged, isolated, and o 
wae asi families in England, whose mean Stanford-Binet I.Q. 
Andines i ii 69.6. Hirsch (1928) has in general confirmed the 
taines Sa jerman and Key in his own study of Kentucky moun- 
Tessa children, though his work was less well controlled. And 
ee ani Thomas (q.0.. showed that children in a “good” 
Sorat istrict were definitely superior 1n intelligence test per- 
Ae ce to those in a “poor” one. 
Pi ees time we must be on our guard against uncritically 
ee community ratings as safe socioeconomic indices. 
fens ng’s results (q.0-) revealed no inferiority 1n the test per- 
ance of rural children when only those of native American 


st m Š š n p È 
a equal in occupational classification and in educational op- 
unity to urban children were considered. She remarks that it 
is involved, 


ra necessarily the rural environment as such that is 
che ee plus various frequently found concomitants ; and 
oe x a that other things being made equal, the country 1s 
raise ri as beneficent an environment as the city in which to 
the nat ildren. This is a salutary caution, for without prejudging 
ure of the relationship or considering whether it is one of 
ioeconomic setting upon 


ca : 

Seal and effect, the bearing of the soci 

re ligence test performance is by no means simple, and indeed 
quires a good deal more analysis than it has so far received. 


3. . ‘ 
The question of causation 


ae established fact is that, alth 
cer igence test scores maintain a 
Conomic status. There have been nu 


t rane 
o none of them wholly convincing. rE 
. One hypothesis has been that the relationship may be due 


? selection. The less able members of an underprivileged com- 
unity may tend to remain in it while the abler tend to migrate. 
ect capable persons may gravitate towards the less desirable 
Pr ing. Various investigators, among them Hirsch (1928) and 
ey and Ralston, have made this assumption. But there 1s 
S satisfactory proof of its truth. Duff and Thompson, in their 

udy of the distribution of intelligence in Northumberland, found 

at average test performance was higher for rural children far 
rom cities than for those living neat them and thought this might 


ough there is much overlapping, 


definite relationship with socio- 
merous attempts at explana- 
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be due to the latter having greater opportunities to migrate SO 
that selection would be more pronounced. But their work was 
done in the rather special setting of rural northern England, and 
the finding has not been confirmed elsewhere as a general phe- 
nomenon. All one can say is that selective migration of the abler 
members of underprivileged communities and the selection of low 
intelligence jobs by underprivileged persons is not impossible, the 
second appearing particularly plausible. But there is no substan- 
tial evidence either way. 

B. The type of test used in many of the investigations is often 
thought to favor the abler and more privileged individuals, and 
particularly the urban children. Undoubtedly this is a factor to 
be considered. Shimberg (q.v.), for instance, constructed an infor- 
mation test with two comparable forms. One of these two forms 
she scaled on an urban standardization group and the other ona 
rural standardization group, and when they were given to new, 
subjects, the urban children exceeded the scores of the rural 
children on the urban form, and vice versa. It should be pointed 
out, however, that this was an information test, and so far as the 
present writer knows, there have been no investigations to ascet- 
tain whether the information items often found in general intelli- 
gence tests prevailingly favor different socioeconomic groups- 

_ In particular, it has often been claimed that verbal tests favor 
the more privileged groups, at any rate to a much greater extent 
than performance tests. That there is a. differential effect per? 
seems not improbable, though its extent and still more its inte 
pretation are both open to question. Baldwin and Fillmore t441 
studied all the children up to the age of 16 years in four ror, 
communities and an Iowa city of 000 ae both verbal ee 
nonverbal tests. They found that while the verbal intellige” cy 
tests that were given showed a striking language weakness amo 


the rural children, there was n mo? 
o such non eakness 4€ 1 
strated. Jones, Conrad, and Blanchar verbal w oft 


d (qg.v.), in a very caT” ga 
study of the problem, used the Stanford-Binet scale and Y3 of 


other i i i 
aa including performance tests, with population’, t5 
superior and average children in urban and rural enviro”! 


n 
; t$ 
running to many hundreds. They found that many of the su tes 


: á s 
were wrongly placed in relative diffculty for the rural groves 


that the highly verbalistic content of the scale in its upper nce? 
Operates as a handicap with rural children, and that differe! to 
in test performance between different socioeconomic grou i 
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Phe eee a function of the mental processes being tested. While 

ke ildren exceeded country children on verbal items and sub- 

he country children were definitely superior to urban children 
such subtests as the Mare and Foal and on most of the form ' 


board tests. 
see is not true, however, that when performance tests are sub- 
uted for verbal tests, socioeconomic differences in test per- 


| Sam disappear. They are usually reduced, and an instance of 
Ke may be seen in the tabulation of the results of Sherman and 
ieee 36. Even this, however, has not always been found 
8 en e case. Thus L. W. Pressey (q.v.) tested 357 children 6 to 
which | of age in grades 1 to 3, using the Pressey Primary Scale, 
woüld sa nonverbal test. The belief was that the use of this test 
rete fee to minimize the socioeconomic differentiation of the 
ean children, yet only 22% of them reached the urban age 
ee Thus the facts of the situation are by no means unequiv-\ 
y established. 
Sen even if performance test scores do show a less marked 
Sethe ne to socioeconomic differences than do the scores of 
ce eee tests, the question still remains as to which 
Seated instrument is the more valid and important indicator of 
as a status. We have already seen that the two types of tests \ 
would Age very high correlations with one another. Thus it 
Patte e most improbable that both of them would show the same 
that in of relationship with a third factor. Also, it may be said 
the ere is little doubt which of these two types of test is on 
whole better constructed, or which yields scores of greater 
cussing is often expressed 


ges significance. The point we are dis 
Saying that verbal tests are “unfair” to children in under- 
be that the differences they reveal 


Privileged groups. But it may 
respond to the facts. 
ii A very striking and suggestive finding is that the differen- 
oft ion associated with an unfavorable socioeconomic setting is 
en cumulative—that is, it becomes more marked the longer the 
ae concerned remain in it. Thus Baldwin, Fillmore, and 
eee (q.v.) report that infants in an underprivileged rural set- 
se are not below the medians for infants in privileged urban 
ings, but that differentiation increases with age. Sherman and 
eile have shown that intelligence ratings tend to drop, and de- 
dé es below test norms to increase as children in underprivileged 
cumstances grow older, this being particularly true of children 
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in the remoter “hollows” studied. H. Gordon (q.v.) also finds a 
striking drop in the intelligence ratings of canal-boat children as | 
they grow older. In many families the youngest children would pe 
group around the I.Q. levels of 90 to 100, while the oldest children 
would test almost feeble-minded. Honzik (1940), again, demon- 
strate an increasing relationship between the intelligence test 
performance of children as age increases with such factors as 
mother’s intelligence, parental education, and the general socio- 
economic status of the home. At the age of 21 months her data 
show almost no relationship, but at 8 years the correlations range 
from .33 to .55. Crissy (q.v.), too, found that young children 
studied by her lost an average of about 1o points I.Q. in 18 
months in an orphanage environment. Honzik (1940) also shows ° 
that correlations between child intelligence and maternal intelli- 
gence show a striking rise from 21 months to 8 years, being N; 
negligible at the former age and definitely significant at the latter- ‘ 
The last three studies go beyond the socioeconomic factors We are 
now immediately considering, but they contribute to our aware” 
ness that the relationship between intelligence test performance 
and the type of setting may grow closer over a period of time. 

Such cumulative effects, it is true, do not always appear. TPUS 
Hirsch (1928) has reported a correlation of —.23 between 1. 
and C.A. for the underprivileged mountain children he studiec 
This means a tendency for test performance to fall with Jengt 
of time in the environment, but he dismisses the coefficient aS too : 
small to be significant, which, however, is a matter of opinion, fof od 
it admittedly has been found. Other studies, too, have failed t° | 
show a cumulative relationship, but ìt has been found in many , | 
mna whatever the reason may be 1 

. In summary, two points emerge „cut caus4 

explanation of the repeatedly Pe At cakip betwee? | 
socioeconomic circumstances and intelligence test scores iS forth | 
coming. The problem will be reopened paa a wider basis in con 
nection with a later discussion of hereditary and environment@ 
influences. But the evidence up to this point manifestly disallow 


partisan attempts to attribute everything ei or 
; 3 g either to the one 
the other. Moreover, the hereditarian-environmentalist € ontroversy 


by no means possesses the all-obscuring j etime? 
i i as g importance som! 
attributed to it. (b) The positive and very fcrpaasant truth oF 
unmistakably emerges and will be further confirmed is eet 
person’s mental test performance is a concomitant of the tot 
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circumstances of his life. It is related’to the conditions created 
by the occupation of his parents, if he is a child in the home. 
Incidentally we may note here a highly significant point. The 
relationship is with the general type and circumstances of that 
occupation, and not directly with its financial rewards. Also, intel- 
ligence test performance is related to the conditions created by 
the general community setting in which the individual lives. Fur- 
thermore, it would appear that the longer a given set*of circum- 


stances, favorable or unfavorable, prevails, the closer the relation- 
y far indeed from detracting 


ship is likely to become. This is ver 
rom the value of mental tests, for when it is found that whatever 
y with the sum-total of the 


oy may measure interacts widel tal 
Subject’s living, a wealth of psychological content and significance 


iS indicated, It is just what, on general grounds, one would think 
pught to be found if psychometric instruments are more than 
trivialities. What is clearly inadmissible—and it would be disas- 
trous to the significance of the tests themselves—is to think that 
they reveal some factor in the human mental make-up quite 


Unrelated and unresponsive to anything else. 


Famity RELATIONSHIPS AND MENTALITY 


The question of the relation of mentality as indicated by test 
Performance to the home and family has been widely studied. 
Intner (1931), summarizing the evidence up to about 1929, finds 
lerarchy of correlations which is shown in Table 37. It shows 

€ resemblance in test performance for varvus degrees of blood 
resemblances in men- 


Telationshi í Jearly that 

3 ship, and it shows very © early eS a 

tality steadil Jationship between the individuals 
a e relationship j 

E k te. Of course, it must be remem- 


comes re remo A 
ered uk pen pore peer eo obtained in the large number 
of studies here summarized vary quite widely about the rough 
averages shown in the table. Some of the most important reasons 
Or such variations are the use of different tests, the application 
Of them to different groups, an the lack of uniformity in the 
8eneral conditions under which the investigations were carried 
on. But on the whole it would seem that the tabulated coefficients, 
which reveal a decidedly impressive hierarchy of descending re- 
Semblance, well represent the established facts. 
Since the date of this summary, however, much work has been 


done in attempting further to analyze and better to understand 
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TABLE 37 


AVERAGE CORRELATIONS IN INTELLIGENCE OF PERSONS IN VARIOUS 
DEGREES OF RELATIONSHIP 


(Quoted in part from Pintner, 1931, p. 512) 


. . Average 
Relationship correlation 
Identical twins -90 
ALL, DAIS ai erations Gals nanas coe +75 
Fraternal twins ... +79 
DIDNT ES: siis kvsasa -50 
ODO arni ys .20 
.00 
.50 


the problem. In particular, the effect of a foster home environ- 
ment upon the mentality of the child, and also the results obtained 


when identical twins are raised either together or apart, have bee? 
quite fully investigated. ` 


1. The effect of foster home environment 


Table 38 brings together some of the chief findings of the eatlie* 
studies of the effect of foster home environment upon mentality: 
Attention should be called to sey this 
body of data. 


A. It will be noted that the study by Freeman, Holzinger, and 
Mitchell in particular indicates a very definite ‘relationship be 
tween the intelligence level of the child, on the one hand, and the 
various characteristics of the home on the other. In fact, the! 
correlations between the I.Q.’s of foster children and the characte" 
istics of the foster homes are in many cases comparable to, 2% 


rarely much smaller than, the corresponding correlations of BUES 
and Leahy between the I.Q.’s of own children and the characte 
tics of their own homes. The correlations reported by Freema™ 
Holzinger, and Mitchell are consistently higher than those ° : 
tained by Burks (q.v.) and Leahy (1935). One explanation sae 
has been suggested is that the foster children studied by Free™ i 
had been placed to some extent selectively; i.e., there was 


eral important points in 


Xa 


—, 
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tendency for children of superior hereditary endowment to be 
adopted by the better families, and vice versa. There is some 
reason to believe that this may have affected his findings, for the 
foster children whom he investigated were adopted at a mean age 
of 4 years and 2 months, whereas those of Burks were placed 


before the age of 13 months, and those of Leahy before 4 months. 


Leahy, in particular, took special pains to avoid the disturbing 1- 


fluence of selective placement. And this in any case would be more 
apt to happen with older children whose characteristics and capac- 
ities had begun to become apparent. 

B. Freeman has reported an increase in the correlations of 
child mentality with foster home characteristics with length of 
residence in the foster home. In 74 cases the children, placed a 
an average age of 8 years, were tested shortly before placement, 
and the mean correlation with home characteristics was -34- wher 
these children were tested again 4 years later, their mental test 
performance now correlated .s2 with the characteristics of the 


was, however, no increase in I.Q. with those placed in the pore! 


type of foster home. In the case of children placed in foster home” 
after the age of 12 years, there was no gain in I.Q. with residen® 
in the new home. i 


D. Where siblings were placed in different foster homes and 
tested after a fairly long period of residence, Freeman find 
correlation of only .25 between their mental test scores. This: if 
course, is much lower than the coefficient of .so whic is mw 
average expectancy of sibling resemblance in mental test perfor? a 
ance as shown in Table 37. Not too much weight can pe put 
this finding, since the numbers involved were not large. PY jon 
indicative and opens up certain fairly obvious lines of speculat! 

E. Turning to the study by Burks, it will be seen that 
obtained consistently low correlations between the mentali pe 
foster children and the ratings assigned to the foster homes ont 
various characteristics indicated. However, these 10W corre cpe 
reports. Thus she finds virtually zero relationship petwee” of 


socioeconomic status of the true tellige” ser 
rai parents and the inter’ fost? 
their children who had been adopted and had been living in £0 
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homes for 2 years. This, of course, would seem to undermine the 
hereditarian conclusions suggested by the lack of effect she reports 
for the foster home environment. According to her, foster child » 
mentality shows no relationship either to the characteristics of the 
foster home environment or to the socioeconomic characteristics 
of the child’s own parents. Burks concludes, on the basis of some- 
what dubious reasoning, that the best foster home may contribute 
as much as 20 additional points to a child’s intelligence quotient 
and that the worst foster home may lower it as much as 20 points. 
She seems to consider an influence which can do no more than 
this as meager and unimportant. But other commentators have 


remarked that if such an influence can produce a total of 40 points 
ndeed—enough, for 


variation in I.Q., it is very considerable i 

instance, to make the difference between classification as feeble- 

minded or on the upper border line of normal intelligence. 
correlations between 


» F. The consistently and strikingly low 
foster child 1.Q. and foster home ratings which have been reported 


by Leahy (1935) may be explained at least in part by the interest- 
ing data presented in Table 39- In this table Neff (q.v-) has 


TABLE 39 
py OCCUPATIONAL STATUS OF 


1.Q.’s or CHILDREN CLASSIFIED 
FosTER PARENTS 


Nefi, Table 5, p. 744) 


(Leahy’s data tabulated by 


Occupation N ean 

I. Professional ... -ettette 43 113 12 
II. Semiprofessional 38 112 II 
IIL, Skilled trades ..-- 44 ur = 
IV. Rural owners ...- — — — 

V. Semiskilled 45 109 i 
VI. Unskilled ... = = at 
eee ey oe aa 


brought together some of the original data reported by Leahy but 
p. not tabulated by her. The striking thing about it is the remark- 
| ably low and slight differences in the mean I.Q.’s of children of 
Parents of different occupational groups. Whereas in Tables 32 
and 33 the differences in mean 1.Q. between the children of parents 


J 
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of the highest and lowest occupational groups range from 20 to 30 
points, here the spread is only 4 points. This indicates two possi- 
pilities. (a) The range of socioeconomic differences in the groups 
studied by Leahy may well have been unusually small. In par- 
ticular, the true status of her semiskilled group may have been 
in fact higher and more closely similar to that of her professional, 
semiprofessional and skilled trades groups than the words them- 
selves would suggest, or than is ordinarily found. If this were the 
case, it would have the effect of reducing the correlations base 
upon these groups for purely statistical reasons. (b) Another pos 
sibility that has been suggested is that foster children perhaps 
receive unusual care and attention, so that once again the real 
and effective status of the homes of the somiskilted group with 
reference to influences bearing on the children was higher than 
might appear. E 

The studies by Freeman, Burks, and Leahy, which have just 
been considered, used a similar methodology and more or less 
supplemented one another. Thus it seemed well to deal with them 
together, and, since they have attracted much attention, to examine 
and analyze them with some thoroughness. We now ‘turn to the 
ae a has been done since the time of their appearance on 
pra ae ee on relationship of foster home characteristics to 

G. Skeels (q.v.) studied 147 foster children placed at an 
average age of 2.7 months, and none of them placed at over ó 
months. The Kuhlmann-Binet was used for those less than 3:5 
years old, and the Stanford-Binet for those over that age. All thes? 
children had true parents whose socioeconomic status was l0% 
On the basis of previous reports of the relationship of socioec®” 
nomic status to intelligence Skeels estimated that the mean 1.Q- 
of this group of children should be about 98. As a matter of facts 
it was 108. Much more striking was the finding that the 1.Q.5 © 
these children had virtually no relationship to the socioecono™ 
status and intelligence of their true parents 

H. Skodak (q.v.) continued this work by studying 154 childre? 
who were adopted very young. She found that the good foste? 
homes had a definitely beneficial effect in terms of intelligere? 
test performance. The foster homes were rated on a rather el@ g 
rate home inventory scale that emphasized many cultural a” 
stimulating factors but disregarded economic status, which ji 


probably quite misleading. On this inventory the homes were 


” 
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oe in deciles, i.e., in ten-point steps of the scale scores 
kee indicated their general excellence. For each such ten-point 
a in home status she found a reliable increase in the intelli- 
has 0 of the foster children there residing. On the face of it, this 
th he appearance of what almost amounts to definite proof of 

e effect of good home conditions in improving mentality. But 


extreme and hasty conclusions should be avoided, because the 
as somewhat subjective, and the 


home inventory scale itself w 
persons who used it to rate the homes already knew the 1.Q.’s of 
the children, which may have influenced their judgment. Also, 
correlations that were worked out on the Skodak data by Good- 
enough (1940) and tabulated show little relationship between 
the criterion of the foster father’s education and the 1.Q. of the 
foster child. It does not seem, however, that these reservations by 
any means completely undercut the conclusions of Skodak. 
_ I. Speer (1940 b) showed that the correlation of foster child 
intelligence to the intelligence of foster mothers is directly related 
to the length of time the child has stayed in his own home before 
adoption. That is to say, the longer the child is retained in his 
his foster mother, 


own home, the less his mental resemblance to 
e greater the resemblance. 


and the earlier he is adopted the g 
dopted children whose true 


J. Harms (q.v.) studied a group of ado I 1 
Parents were of low mentality and inferior socioeconomic status. 


When these children were tested after 5 years of residence in 
foster homes superior to their own, their mentality showed a strik- 
ing rise above expectancy. Thus the mean I.Q. of 87 children 


Whose true mothers had a mean 1.Q. of about 63 was 106. , 
pressive, and their publica- 


_ These results are of course very impress 1 
tion has aroused something of a furore of discussion. But one must 
not rush to extreme explanations, OF suppose that the environ- 
Mentalist case has been proved beyond a reasonable doubt. For 


One thing it is necessary to reca r liability of much testing, 
d to adults. This means that the reported 


Particularly when applie 
intelligence jel of the true and foster parents studied are open 
to a good deal of question. Indeed, the testing of these adults is 
Not above criticism, even in the best of the studies. For this pur- 
Pose Leahy, for example, used the Otis Self-Administering Test 
of Mental Ability which is none too satisfactory for adults. Also, 
e various indices and methods used to rate the homes studied 
are far from perfectly reliable and adequate. 
Still, it does seem reasonably well established that adopted chil- 


ot 


ae 
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dren in good homes do in fact achieve mental test performances 
decidedly in advance of expectation, that prolonged stay in such 
homes may have a beneficial effect, and that the resemblance of 
these children to their own parents in respect of mentality is much 
less close than might be anticipated. One should not argue out of 
hand that biological heredity has no effect on intelligence, but 
only that under certain circumstances, which are not fully under- 
stood, but which probably include exposure to the environmental 


influence at an early age, a good home has a very appreciable 
influence. 


2. Twin resemblance 


All investigations have shown a high resemblance in mentality 
between twins, and a very high one between identical twins. How- 
ever, the crucial issue that has been raised in recent investigations 
is what happens to this resemblance when twins are raised apart. 

Hirsch (1930 b) dealt with this problem some years 480 and 
was able to find only a very slight effect. The resemblance be- 
tween his twin pairs was not significantly reduced when they were 
raised in separate environments. However, he was able to deal 


only with 4 pairs of twins separated rather late in life. 


By all means, the most important and decisive study on the 


topic is that by Newman, Freeman, and Holzinger (q.v.). They 
investigated a very considerable number of twins, including 58 
pairs of identical twins raised together, 50 pairs of fraternal twins 
raised together, and r9 pairs of identical twins raised apart an 

separated early in life. This latter is the crucial portion of the 
investigation. The investigators used a large number of test 
including the Stanford-Binet scale, the Otis Self-Administerin® 
Tests of Mental Ability, the American Council on Educatio 

Psychological Examination for High School Students the Inter 
national Intelligence Test, and the Stanford Achievement Test 
The environment in which these twins were situated was rated bY 
five judges on a number of carefully formulated criteria, and the 
mean of the ratings was assigned as the environmental score 
value. The first question was to what extent differences in the 
environments of the identical twins raised apart reflect themselves 
in differences 10 mentality. Putting the matter otherwise, wou 

there be a tendency for more and more marked environmen a 
differences to be reflected in more and more marked difference 
in mentality in these very closely related children? The answe 


% 
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eight were virtually not 


is . ; 
contained in Table 40. Height and w 
al and social environ- 


ee at all by differing education 
<onttient eight was affected diversely by differing health envi- 
marked a Intelligence test performance was differentiated to a 
conse oe by differing educational environments, and toa 
entona le though slightly smaller degree by differing social 

ments. Educational achievement was very greatly differ- 


TABLE 40 


CORRELATIONS BETWEEN ENVIRONMENTAL DIFFERENCES AND TRAIT 
DIFFERENCES FOR IDENTICAL Twins REARED APART 


(Newman, Freeman, and Holzinger, from Table 93, P- 34°) 


L DIFFERENCE 


ENVIRONMENTA 
RATINGS 
TRAITS : 
Educational Social ad 
__—————— 
Height ........cceseceeeerertt® a es a 
fy. eter. Ea a a5 
Binet I $ or, 
Otis I Q. 19° “sx = 
Inter Q .55 53 =23 
Ame national Test ...-++-> “46 a4 ss 
Sta rican Council Test .-- A “87 ga 


environments, and only quite 


onments. 

s—identical twins raised to- 
and identical twins raised 
1 the indices of mentality 
d group in contrast to the 
sed apart, according to 


Was by differing educational € 
oe by differing social envi" 
ae lations between three group: 
Spare” fraternal twins raised together, 
the —are shown in Table 41. For a 
drop in the correlations for the thir 


Yst is ve iki i ins rai 

ry striking. Identical twins “ par 

the data of this study resemble one another distinctly less than 
s study, 


fraternal twins raised together: and hardly more than siblings 
raised together. as a comparison between Table 27 and Table 41 
Will show. More than this, the investigators point out that the 
data do not tell the whole story. For the separate environments 
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in which their identical twin pairs were raised apart from one 
another were not extremely dissimilar. If one member of se 
pair had been raised under slum conditions, and the other in no 
best and most stimulating home and cultural setting that cou 


be devised, the resulting differences would probably have been 
much greater. 


TABLE 41 
CORRELATIONS on Various Traits or THreE Groups or TWIN PAIRS 


(Newman, Freeman, and Holzinger, Table 96, p. 347) 


Identical, | Fraternal, Identical, 
Traits raised raised a aiet 
è together together apar 
Standing height ................. 981 934 pf 
Sitting height 965 «gor e 
WEEDE 5 onsen? aanse yara -973 -900 88 
Head length . 910 691 Oy 
BROAN AGES. sccscesassrarasevaisious%erorwe ease -908 -654 o 
Binet M.A. . ane 922 -831 Oar 
Bink TQ. sscciee sicis-scsiacare 910 640 ie 
OHEITO! cs ccrnes inane me. 22 621 OF) 
Stanford Achievement Test....... 955 833 af 
Woodworth-Matthews ........,., 562 +371 ae 
Still, here again it remai ae 


drawing general conclusions. 
as that of Newman, Freeman 
as a basis for interpretations 


ns necessary to exercise cautio ody 
Even so thorough and careful a ae 

, and Holzinger cannot safely be must 
going far beyond its range. One ! ö 

always keep in mind the unreliability and ambiguity of ever’ ity 
best intelligence tests, and the greater unreliability and ambls ays 
of the most careful ratings of environment. Also, there is awa 

the question whether a result obtained with one group of SUb)" 4. 
will be duplicated with different ones differently situated i 
Carter (q-v.) and Goodenough (1940 a) very properly point co, 
the issue is not the support of s Sane 


) ome sweeping partisan por iron 
but it is first to ascertain what seems to be the effect of en what 
mental change on some particular group, and second, just 


environmental factors are prepotent. 
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SCHOOLING AND MENTALITY 

A, ‘ ; 
Outstanding facts that have been established 

E very large number of investigations in which mental tests 

g ve been administered to school groups have yielded certain out- 

aing facts which may be regarded as thoroughly established. 

- Schooling selects intelligence. That is to say, the less intelli- 


Sent tend to drop out. Mean intelligence scores tend to rise grade 
1 and college. Today this 


by grade, particularly in high schoo i 

stil] ency is less marked in high school than it used to be, but it 

Ma manifests itself in the upper high school Jevels and in college. 

the ny colleges enroll few students whose intelligence is not above 
Population mean. 

B. There is a marked positive relationship between intelligence 
test scores and school achievement. This centers approximately 
around a correlation of .50 oF perhaps somewhat lower between 
intelligence and average grade. The relationship varies consider- 
ably, however, and is lower for some levels of schoolwork and for 


Some kinds of schoolwork than for others. It is much less deter- 
minate and also decidedly lower 07 the whole for achievement In 


Ee school subjects. a p 
y& The selectivity of different institutlons differs very greatly. 
The ablest student in one college may be inferior to the least able 
E another. Thus what is called a “good college risk” will depend 

Pon the institution concerned. 


D. Many other factors besides mentality and mental ability 


etermine both intention to continue with an education and actual 
Continuation. Economic status is undoubtedly a major influence. 
is was shown many years 280 Book (q.0-) and Counts 
(@.v.), and recently by Davis (g.v.) and by Karpinos and Somers 
ho has the price and 


(@.v.). Thus it has been said that anyone W : 
an get a college degree in the 


is will; 
willing to spend the money © Á 
nited States, irrespective of his mental capacity. i 

But while these findings are highly enlightening, and of major 
importance in many ways, the prime question which has recently 
Come to the fore is whether schooling actually affects intelligence 
as well as tending to select it. To this we now turn. 
2 

The effect of preschool attendance 

The effect of preschool attendance upon mentality has become 

a very prominent issue in recent years. Clearly it suggests many 
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i i omentous implications. The evidence, S0 far 
amend os suse, which is not at all completely, 1s mae 
ee a The investigators there cited report seinen ho 
some gains in mean I.Q. occur during preschool attendance. ana 
times they are small enough to be considered negligible or wage 
ful. But sometimes, on the surface at least and in adea wa 
analysis, they appear quite marked, and above all some © aa 
are cumulative, increasing with the length of stay in prese 


TABLE 42 


¡DANCE 
Gars IN I.Q. OVER VARIOUS PERIODS OF PRESCHOOL ATTENDAN 


nA 
= YEARS 
One Year | Two YEARS THREE YE aS 
INVESTIGATOR aes i 
ain 
N Gain N Gain N or 
Anderson: ea sieeaawneinie 26 2.6 
Bi scinni 54 1.8 
Frandsen and Barlow...| 30 3:3 29 14.2 58 
Goodenough .......... 84 4.6 5I 6.2 13 4 
Starkweather and Roberts 703 5-5 0.5 
Wellman ......0.00005 652 | 66 228 | 10.4 | 67 se 


: n- 
The most important additional work on the problem '5 sun 
marized as follows. sats if 

A. McHugh (1940) reports an average gain of 6.07 Po vonts 
1.Q. during an average preschool attendance period of 1-9 eae ly 
for 91 children. He points out that during this short time vI" a) 
the same gain took place as that reported by Wellman (19 in 8 
for a whole year. His conclusion is that in each case the erent 
not due to any general influence of the preschool environ can 
but to a specific practice effect from the first testing, 4" 
therefore be disregarded. of 183 

B. Kawin and Hoefer (q.v.) report an average gain a an 
points in mental age during a full session of somewhat mon that 
30 weeks in preschool for 22 cases. They found, howeve t peri- 
their control group, which was carefully paired with i k out 
mental group, and which did not attend preschool, ma F ogchoo! 
the same gain. They therefore question the effect of the p 
environment. 


en 
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C. Olson and Hughes (q.v.) report consistent gains in I.Q. 
during a nursery school attendance period of 2 years, in com- 
Parison with a control group of children who did not attend 
nursery school. But they attribute this gain not to the influence 
of the nursery school itself, but to more general socioeconomic 
influences, for when socioeconomic factors were equated, the ad- 
vantage of the nursery school group disappeared. 

D. Pegram (q.v.) reports that from 400 to 599 days of pre- 
school attendance gave 40 children an advantage of 4.9 points in 
I.Q. over a group of 40 matched control children who did not 
attend. 

E. Skeels and Dye (q.v.) report a mean gain of 27.5 1.Q. points 
for 13 children transferred at the mean age of 19 months from a 
very impoverished to a stimulating environment which was in 
effect that of a preschool. 

F. Skeels, Updegraff, Wellman, and Williams (9.2-) report that 
a group of children placed in a preschool described as good but 
Not equal to the best gained 3.7 I.Q. points in from 200 to 399 
days, and 4.6 points in an attendance period of 400 days and over, 
while controls matched with them who did not attend lost an 
average of 1.2 and 4.6 points-in the samg periods. 

G. Woolley (q.v.), in what is the pioneer study of the problem, 
Published in 1925, studied an experimental group made up of 
Pupils enrolled in the Merrill-Palmer School, and compared both 
to what she refers to as the “Terman” group. By the “Terman” 
group she means the group of children used by Terman for his 
report on the constancy of the I.Q. which is tabulated on page 141 
of his The Intelligence of School Children. A portion of this table 
appears in this book as Table 48. Woolley’s findings are shown 
In Table 43. As will be seen, she showed a much higher percentage 
of children who gained in 1.Q. and much higher mean gains among 
those attending the Merrill-Palmer School than among the com- 
Parable waiting-list children, or among those included in the 


«e 
Terman” group. 


_ A pertinent question 
in L.Q. that have been reporte 


that has been raised is whether the gains 
d during preschool attendance are 
Permanent, Here the most striking study is that by Wellman 
(1937). She followed through two groups from preschool to col- 
lege. The first of these two groups was given the Stanford-Binet 
tests at a mean age of 66 months, and took the College Entrance 
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ifyi <aminati niversity of Iowa at a mean age of | 
pen vas at oe antiga id had attended preschool. i 
e ei group of 57 members in all took the Qualifying pao } 
ination at the same mean age as the first, and had heen Se ert 
Stanford-Binet at the mean age of 72 months. patel es 
had attended preschool. Of this second group, 21 members 


TABLE 43 


ISTANT 
T.Q. CHANGES IN THREE Groups, CHANGE oF st5 CoNsIpERED CONS 


(Adapted from Woolley, Table 4, p. 478) 


MERRILL-PALMER MERRILL-PALMER gp 
ScHooL Scuoor Warr- TERMAN GRO a 
Pupits ING List 
- - Per- | Av. 
N i = N o E cent | change 
cent | change ent | change — 
o. 
Increase | 27 | 63 19.7 | 12 | 33 12.7 25 25 pet 
Decrease} 8 | 18.5 | 108 13 | 36 16.2 27 27 
Constant) 8 | 18.5 b m | 3i 47 47 
Totals | 43 36 99 


matched with the 2 


re- 
t of the first group who had attended P 
school, matching bei 


m- 
ng on their original Stanford-Binet ao 
ound that the 21 who had attended pres 
were consistently ahead in all subsequent tests of those who 
not attended. So for another group of 41 who had attende nded- 
school matched with another group of 41 who had not atte tio 
When both groups were given the American Council on Eana ieir 
Psychological Examination for High School Students during elled 
high school career, those who had attended preschool ex© 
those who had not. rob- 

While this is decidedly the most impressive study on the P nd 


lem, it does not stand alone. By way of confirmation, olson gren 
Hughes (g.v.) found consistently higher later M.A.’s for ¢ “4 pem 
who had attended preschool than for children paired wit othe 
on a first early testing, but who had not attended. On the 


re- 
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ae ilaren (1928) found that while children entering the first 
e aia considerable time in preschool tested on an average 
ee above children otherwise comparable who had not at- 
I ed, yet after 18 months in the elementary school the average 

Q. of the first group had dropped, the average 1.Q. of the second 
one had risen, and the original difference of 6 points was re- 
Hang | to 2 points in favor of the first group- T. J. Peterson (q.v.); 

0, investigating a group entering the University of Iowa Ele- 
painy School, some with a preschool background and some not, 

ound the former 3.6 LQ. points ahead at entry, but only 2.6 
Points ahead at the end of the year. Finally Voas (9.2-) reports 
that a group of rrr nursery school “graduates” in the schools of 

innetka, Illinois, on several subsequent Binet testings at several 
age levels, had almost the same mean I.Q. as the total 896 pupils 
tested, and were in this respect indistinguishable from them. 

Wellman (1945), summarizing the literature to date, finds that 
of 22 preschool groups 11 had Stanford-Binet gains of 6 points or 
More, the total number of cases being 1537- Of 14 non-preschool 
groups, only 2 had similar gains. She repeats once again her con- 


tention that the results at the University of Iowa are not unique. 
attempt to appraise them.* 


So much for the data. Now for an 
_ A. First, as to the general trend revedled, three statements are 
in order. (a) As expressed in averages, the trend is unmistakable. 
In almost every one of the studies some mean gain during pre- 
t may be explained or 


plas: attendance is reported, however i D 
qualified. (b) The mean gains reported are sometimes small, and 
ìn these cases the authors often consider them negligible and 


Meaning] + ved averages do not tell the whole 
less. he obtained aver?s ; 
plese, Roweren | an average gain of only 1.8 


Story. Thu i fi 

rhe s Grace Bird, who #n Page DS 
Points (see Tabie 42) ant is inclined to dismiss it, also reports 
individual gains ranging from © to 25 points. a U e G 932), 
Whose i onsiderable, finds individua gains 
i reported mean galls ae quite possibly be that 


in IQ. runni i ints. It may 

p ning as high as 40 pol? S. e 
the preschool “environment has a greater effect on some children 
than on others. (c) Consideration should be given to the popula- 
tions used in the various studies. These will be found in Table 42 


and in the summary presented above. It will be seen that the lowa 
to a considerable extent upon the 


* Th i is based 
E naar here p E fa h; Stoddard, 1940, 19435 Stoddard and 
au Aa anes The reader is referred to these sources for a fuller 


ellman : 5 
treatment. and Wellman. 1940- 
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group reported on by Wellman, for which substantial and ain 
ing mean gains were found, is much larger than all the o 
a statistical basis of the findings, particularly those mar 
lowa, has been critically examined and reviewed by ‘aay! 
(1940, 1945); and reworked with new techniques by W ellman st 
Pegram (q.v.). The general upshot seems to be that intellige ca 
quotient changes are at any rate associated with preschool expe! 4 
ence, whether caused by them or not. Also a number of gaun 
have argued that the reported gains cannot be considered We 
established because of errors of measurement and imperfections 
in the tests used. Two points are here involved. a 
(a) A typical finding is that during preschool attendance 10n 
L.Q.’s tend to rise, medium 1.Q.’s are less affected, and very hig 4 
ones tend to fall. This situation is exemplified in the data ne 
Wellman (1932) presented in Table 44. It will be noted t 


TABLE 44 


PERCENTILE GAINS OF CHILDREN CLassıFiED BY I.Q. OVER One ANP 
Two Years or PRESCHOOL ATTENDANCE 


(Wellman, 1932, Table 3, p. 53) 


T 
Classification bie of |Gain one | che of |Gain = sD- 
ren | year || children | years | __— 
hs 28-5 
Below average ...| 19 22.1 | 19 10 36 20.5 
Average sami «| 104 23.6 | 21 35 35 23:1 
Superior ........ 65 15.1 20.6 26 17-4 141 
Very superior .... 61 6.7 | 10.2 24 at 49 
GUS: oie. aces sce 18 —3.9 | 10.2 i 7 -21 | 
i 


in 
whereas the ro children classed below average make a mea” e 
of 36 points in 2 years, the 7 who are in the “genius” a 
lose 2.1 points in the same period. Goodenough (1940) has © 
attention to this as a serious issue. She suggests that it momen 
phenomenon of regression towards the mean, i.e., of the eia of 
of all extreme deviations to rectify themselves in the direct ould 
the average over a period of time, and argues that this 
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m : 
ae eee gains illusory and transient. It is somewhat 
category 2 8 this ree for the numbers in the “genius” 
very slight, quite small, and their average loss of I.Q. is certainly 
Pelee. pest more important objection, is that mental 
ahd do not S at early ages are of doubtful reliability and validity 
This hs al predict later test performance or mental status well. 
‘eis e already been noticed in our previous discussion of mental 
Brok e, young children. It has grave implications for the present 
hoa since if earlier and later testings are unrelated, one can- 
ay whether the alleged gains mean anything or even if they 


are real, 


Wellman (1940) has made a detailed reply as follows. (a) Test- 


in > š 
aa young children reveals a wide range of I.Q. changes asso- 
Bra with preschool attendance even among selected cultural 

ups. (b) Such gains are found consistently for all ages as 


etween fall and spring testings, whereas unreliability would pro- 
(c) There are only small sex differ- 


reliabili e of presumptive evidence for 
the ility. (d) Test-retest correlations at preschool ages are within 
range of test-retest correlations for a wide range of school 

the «vhole research literature 


a 
ges, Nemzek (1933 b), reporting on 
n for the Stanford- 


test correlatio 
JIman’s own test-retest correla- 


d represented as a good 


early testing has been exagge 
Il be recalled, our survey of 


ee universal than it is. As wi 
Rees ata on early testing indicated thi 
i at very early ages, by about the third year 

Ppreciable. 
Taken together with the facts, these arguments have much 
d may be doubtful indicators 


f 
afin Test scores in early childhoo € n 
ater status, although the claim needs considerable qualification. 


n the other hand, such consistent and widespread gains as have 


re discovered are impressive and must have some explanation. 
is on this datum that Wellman ultimately and very properly 


Tests her case, and it cannot be disrupted by any a priori argu- 
gs or indeed by anything short of a direct demonstration that 
€ alleged gains themselves do not take place. All any less de- 
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cisive argument can properly do is to leave a certain reasonable 
doubt as to basic explanations. 

C. Another possibility that has been suggested is that the gains 
in I.Q. may arise from a practice effect produced by the first test- 
ing, which makes later test performance better, or perhaps from 
a general adjustment to school conditions which makes the child 
more “test wise.” The most direct evidence here is that of Mc- 
Hugh, who reported gains in slightly less than 2 months which 
were as great as those obtained by Wellman in a full school year. 
Also, it has been found that gains between testings in the fall 
and in the spring are usually more marked than those between 
testing in the spring and in the fall, with the summer vacation 
intervening. But considering the consistency with which at least 
some increase in intelligence scores is associated with preschool 
attendance, and the fact that they are often found cumulative as 
the time of attendance increases from one to two to three years, 
and also that they seem associated with attendance at one type 
of institution much more regularly than with attendance at others, 
it is difficult to believe that practice effects and general adjust- 
ment account for all the reported results, although very likely 
they do have some influence. 

D. Several of the studies, for instance that of Goodenough 
(1940), while reporting gains in intelligence scores for preschool 
children, also report about equally great gains for control groups 
who do not attend preschool. However, as Stoddard and Wellman 
point out in reply, one need not assume that a preschool is 4 
unique environment and the only one capable of affecting intelli- 
gence test performance. Surely a good home might have a similar 
effect; and considering the evidence presented on the effect © 
foster home environment, this seems quite reasonable. Thus, whe” 
a control group gains comparably to a preschool group, this does 
not prove that preschool attendance has no effect, but. only that 
some other institutional setting may have a similar one. This may 
perhaps be the reason why the preschool and kindergarten “gradu- 
ates” in the Winnetka schools were not found distinguishable in 
intelligence test performance from the usual run of pupils, for 
Winnetka itself is a privileged environment. Also, it accounts for 
the finding of Olson and Hughes that when socioeconomic con- 
ditions are equated for preschool and non-preschool children, the 
differential disappears. Also, it accounts for the fact that in pre- 
school it is the average or below average child who makes the 
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pecs 
make li fe oe whereas the superior or very superior child may 
hildren OPEN ba ka ech oot A ori it is mo the superior 
bs me conditons Srel iti joy the better an more stimulating 
to say that so y, it is no argument against a good hospital 
its services a people have such good homes they do not need 
E. In ge any other people emphatically do need them! 
and broad mary, the evidence can onsidered conclusive, 
avoided S statements of cause an! ainly be 
been a statements are not onl 
attention Po, but also, and more i 
Vhat a rom the real issue. 
ässa oa is that certain types of school environment are 
cifically Sn superior mental test performance, and more spe- 
gains iE improved mental test performance. Just why such 
contends place and whether they are as lasting as Wellman 
question Are au aona to be treated with much reservation. The 
with fe viously for anyone who wishes to deal constructively 
environm an nature is what the characteristics of such favorable 
answer aal settings are. It is not impossible to arrive at an 
Kus a is at least suggestive. 
that oan Updegraff, Wellman, and Williams (g-v-) found 
Scribe as ren benefited greatly by attendance at what they de- 
a reasonably good preschool that was organized in the 


OrDha 
the nage home where they had been placed. What, then, were 
e environ Preschool took 


c 
most caracteristics of this favorabl 
Were the day. It began at 8 am. with vigorous play. Then there 
a end activities, Tomato juice and codliver oil were served 
acti, 930- Then there was a rest peri d by constructive 
and a story oF excursion, after 
selves f ]. After 
+1 about 3 P-M+ with 


continue 
ive activities generally, 


th 
ere were naps. School 
bout five 0°C fter that, supper 


: aa and construct 
and bed, . continuing until a 
ie and Dye, again, reported the sensational and perhaps 
ina apa ile gain of an average of 27-5 1.Q. points for 13 children 
i orphanage, the gain being associated with a special 
Ure of ment organized within the institution. The essential fea- 
these this environment appears tO have been special care tor 
dren, organized by using older girls who were inmates 
nstitution and who were mostly feeble-minded, but who 


324 PSYCHOLOGICAL TESTING 


could nevertheless stimulate and help their small charges, spend- 

ing time with them, and lavishing affection upon them. Toys, 

games, excursions, and activities were also organized. A consider- i” 

able amount of not entirely worthy ridicule has been heaped on 

this report, and particularly upon the effectiveness of using feeble- 

minded older girls (see “Personal Opinions of the Yearbook Com- 

mittee”). Gesell and Amatruda (q.v.), in a most impressive pas- | 

sage, have described the mentally depressing and lowering effect 

of a uniform large-scale institutional setting upon the mentality 

of young children, even when that setting is as good as it can be 

made. Ore of the crucial characteristics of a favorable environ- 

ment, in school and elsewhere, may very well turn on intimacy, 

personal interest, personal stimulation, and affection. Quite pos- 

sibly geniuses may supply these things better than morons, but ë 

to pour scorn on the efforts of the latter, particularly when those yj 

efforts seem to have been successful, is decidedly out of place. Í 
A report of much interest in connection with interpreting thé 

data before us is that of Lamson (1938). She studied 14r children 

in an elementary school with a “vital” curriculum, i.e., one that 

considered individual interests, required activity and self-direction, 

and allowed progress at optimum individual rates. She found 1° 

gains in LQ. between the first and fourth grades. One must not 

think in partisan terms of this report as “disproving” other studies, 


for that is not the issue, What it does suggest is careful analys!§ 
to find out which environmental characteristics are effective a” 
which are not. 


As Stoddard, speaking of the children for whom marked gains 
were reported, admirably puts it (1943, Pp. 391-92) : “It is neither 
superficial glibness nor a familiarity with test fragments that i$ 
being built into the tissues and behavior patterns of these chil- 
dren; it is a sound and persistent connection between the chi 
and objects, places, persons, problems and relationships. 
preschool child, for example, is treated as a person capable ° 
thought, who demands exercise and achievement in the abstract 
as a part of the spinning out of his natural endowment Placed 
in a dreary cocoon of life, without much guidance at home a life 
that for millions of children from two to five years of ‘age i l 
characterized by negativism and restriction, the child fails to gr? 
satisfactorily ; from the standpoint of mental and social exper" 
ence, he endures a season of under-nourishment.” 


d 
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3. The effect of later schooling 

idence that continuation in school is 
entality as revealed by test 
) made a study of 70 college 
on the 1927 edition of 
chological Examination 


ME a is substantial ev 
oe with improvement in mi 
ee Thus McConnell (q.v. 
tbe > who had been tested as freshmen 
for American Council on Education Psy 
editi ollege Freshman. As seniors they were tested on the 1928 
val ion, and the 1927 scores were transmuted into 1928 
ona A mean gain of 40 points was found in the composite 
t res. Again, Thomson (q.2-) studied 106 students who had taken 
sak, 1935 edition of the American Council Examination as high 
E ool seniors in January 1937- In September of the same year 
wW ey took the 1937 form of the test. The scores on the 1937 edition 
ere transmuted into the va edition. A mean gain 
Of 14.5 points was found. It shou that the elapsed time 
a testings was much less tudy than in that 
McConnell. So, too, Livesay (g.v.) found a mean gain of 44.8 
peni on the American Council Test for 50 students in college 
‘oo 4 years. And Rogers (q.v.) has repor 

he yn Mawr. Also Barnes (q.2-) 88V@ the A 
bee freshmen on admission, and to the sam 
Sophomore year. He found a net mean gain 0 

I he most important study on the issue is that of Lorge (1945). 
ts chief findings are summarized in Table 45. Lorge retested in 
1941 a group at 131 persons all of whom had been members of a 
8tou : : ears earlier. In the 1941 

P of 863 tested in 1921, twenty yea ; : 

Test of Mental Abil- 


testin j inistering 
i g he used the Otis Self-Admin! = Tes the Thorndike 


ity, Higher ae m B, an 
ee os eaten ae igh School Graduates, Form V. In 

€ 1921 testing the Thorndike-McCall Reading Scale and the 
E.R. Arithmetic Test had been used, and composite scores 
Worked out. The scores s0 derived from these two tests measure 


about : dard intelligence test. As will 
Ww standar p $ A 
hat is measured by 3 substantial relationship be- 


e Seen was 
Ween rii agh A “resting after a lapse of twenty years, 
and the amount of schoolin: ‘ch the subject had received 
uring the period. Amount of schooling is interpreted as grade 
Completed A chow how the table should be read, consider those 
M the step interval 89-98- This classification indicates the com- 

oo ing down the columns, we 


Posite score they made in 1927 Read 
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fin ‘ 
8, : He average Otis scores in 1941 for those who had completed 
the fener a grades are respectively 39, 38, and 37, but that 
15.. 18 fen tis scores made in 1941 for those who had completed 
an a 17 school grades are 53-5 and 54.5. This amounts to 
AE aa in their favor of 2 years of mental age on the Stan- 
show: stk scale. An examination of the tabulated results will 
Togt telli evidence pointing towards the assocation of high 
were obt ligence scores with continuation 1n school. Similar results 
aA for scores on the Thorndike test also. 
apa (1946) raises various critical points on the Lorge study. 
ntends that the method of tabulating exaggerates the gains, 


whi 
ay although real, are “modest,” and no cause for “smugness.” 
e objects to the translation of group test scores into age 


sc 
Ores and I.Q.’s for adult subjects. 


4. 
General evaluation of results 
rstanding of the relation- 


pa de that contributes to an understandir | 
fången ween test performance and schooling is of great 1mpor- 
8 Go or psychometrics, because of the close connection between 
inte and validation of tests and the institutional 
all me ment of the school, which has already been noted. This, by 
proper an. is the point on which one should concentrate for a 
ing ¢ appreciation of the work that has been discussed. Sweep- 
reser ausal “explanations” should be treated with the greatest 

ve. The evidence does not decisively support them, and they 


deg 
€ct attention from the really crucial issues. 
school and exposure to 


a seems clear that continuation in a s 
Associa types of school environment, particularly in early life, are 
with lated with superior mental test performance, and above all 
far pe improvement of mental test performance. This is very 
indi rom a disparagement ts. On the contrary, it 
, Cates that their psycho nd implications are 


sociate 
ble and stimulating con- 


ther than detracts from its 


timations and implica- 
l organization. What 


- The work discussed contains many in 
to foster mentality ? 


tion: i 
ind Of the highest importance for educationa 
of school environment should be set UP 
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. ? 

conventional school environment fail to do s7 

a the pertinent, broad, and vital questions taisen, 

And it would seem that the proper use and interpre Certaiti 

mental tests can at least help to provide the a kinds? 

kinds of environment are effective, Others are not. Whic ees 

Here is contained a multitude of challenges for further ste if 
and for practical organization. But we can lose sight of t 


i i i iron- 
we persist in trying to focus everything on the issue of envi 
ment “versus” heredity. 


« a : in- 
Also, broader problems of educational administration are 


own that while the school selects intei 
nperfectly. There is an enormous Faci 
ntal quality and power, due to the Ao 
l is determined so largely by economi 
rather than intellectual ability. And if schooling not only selec 


. 5 ste 
ces it, the seriousness of such wa 
becomes far greater, 


MENTALITY AND Race 


Thirty years of experience in applying mental tests to the studY 
of racial Psychology has thro 


1. Racial Superiority and inf 


It has long and consistent] 
can Indians are considerab 
degree, inferior to w 


eriority 


eri- 
y appeared that Negroes and p 
siderably, and approximately to the jot 
hites in mental test performance, The eee 
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Ne White draft on Army Alpha was exceeded by 20% of pure 
groes, by 25% of Negroes with one-quarter white blood, by 
oie of mulattoes, and by 35% of quadroons. Also, in testing on 
Considerable scale in the state of Virginia, it was reported that 
ee Mean test performance of pure Negroes was 69.2% that of 
ites, the test performance of Negroes with one-quarter white 
Sod was 73.2%, of mulattoes 81.2%, and of quadroons 97.8%. 
his has often been considered clear-cut evidence for hereditary 
racial differences. 
m great systematic difficulty of all such work, however, is 
ays to determine the degree of racial purity or admixture. 
erguson’s criterion was skin color. He determined the racial com- 
Position of his subjects by matching their skin pigmentation 
against color combinations containing known proportions of black, 
White, yellow, and red. It was assumed that the greater the pro- 
Portion of black, the purer the Negro strain. This has proved 
entirely fallacious, for many Negroes with very dark skins are 


Very far from being pure racial samples. The only way to ascertain 
with a reasonable reliability, 


a admixture accurately, or even s 
ould be by a study of family hereditary. Quite apart from the 
abor involved, the data for this, consisting of records of marriages 
and births at the very least, simply do not exist, especially in the 
on of Negroes and Indians. So Ferguson's proposed hierarchy 
apses. 
a Nevertheless, the obtained differences in mental test perform- 
nce are undoubtedly real, and demand explanation. They pose 
Problems of high significance, both for the practical issues of race 
e ationships and for'the proper understanding of mental tests and 
Psychometric techniques. 


Nonracial factors 


Stresse i : 
Steir ey dae Baien elaborated by numerous 
er writers, . i 
ee The first and most obvious of these nonracial factors is that 
language. Attention has often been called to it. Pintner and 
eller (q.v.) used the Stanford-Binet scale, the Pintner Non- 
anguage Group Test, and various performance tests, in testing 
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J- 
i ried racial groups in Youngstown, Pennsy 
: Sa teara hak children who hear a foreign language A 
een test lower on the Stanford-Binet and probably on: any: lty 
guistic test than those who use English at home. This ae a 
may not be serious in the case of Negroes, but it certain I o 
that of Indians. For this reason tests like the Pintner } n 
Language Group Test and the various performance e 
often recommended for the study of racial differences. When aa 
are used, there will probably be less lag with reference to aoe 
white performance. But one difficulty is that such instrum a 
Probably do not measure the same abilities as verbal tests, will 
may have considerably less significance. Another difficulty 
appear in connection with the next point to be considered. 

B. Another type of nonracial f. 
are those designated as cultural 
language, but go far beyond it i 
expressions, similes, proverbs, pic Thus 
have different and gs for different groups. z 
Klineber m Army Beta to a group pie 
ems calling for picture en 
be suitable. A great mere 
net missing from the er 
d of the bowler, the flam 


i ž s t 
ging with the word “crowd” in a multiP, 


Ga n A an 
no distinction between intentional 4 
unintentional injuries. 


: y i 
C. Socioeconomic factors greatly complicate all racial ©° 
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Pariso ; 
pektounan gorge that they are associated with mental test 
for dieren ut socioeconomic categories mean different things 
e Tost st groups. Thus among Southern Negroes, only 
USiness be ptional can become lawyers or physicians or large 
Whites in ne ee and then they are by no means equated to 
May repr e same classifications. So a semiskilled Negro worker 
present a decidedly different socioeconomic level from that 


Gf a een 

é semiski Ths i 

a oo white worker. Such factors may easily have more 
on test performance than racial inheritance itself. Beck- 


am 

fess Fey studied a population of 1,100 Negro boys and girls. 
Tor, and f s for the upper socioeconomic levels ranged from 97 to 
than tho or the laboring groups in the low 9o’s. They were lower 
classificati, of whites in the upper brackets, but below the top 
ference np an socioeconomic status there was not much dif- 
groups. e same difficulties undoubtedly apply to most racial 


D. : 
„>. The impact of schooling, which as we have seen is asso- 
t identical for all racial groups. 


Clated wi 
Reed performance, is no j cii 
tind thar n Indians have fewer educational opportunities of any 
both mo n whites. The selective effect of schooling upon them is 
Questioned caste and different. The content and type of the 
vari al environment available to then: is inferior. Thus groups 
ous races classified on educational status are not fully 


Comm 
ensurable. Also, the suggestion has been made that reactions 
d racial groups 


9 sc : 
May pee are different, and th 
: T e it more seriously and work harder (Ferguson, 1916). 
: The problem of rapport, always i in mental 
ing of different racial 


u Eaa 
rement, becomes crucial in much testing oF © a 
to be suspicious, for he is 


frou 
aware thn he sophisticated Negro is apt 1 i 
leriorit at mental measurement has seemed to relegate him to 1n- 
influen The rural Negro is apt to be shy and fearful. Such 
test pe “se and effects are often the cause of confusion, and of low 
estru : formance which is quite invalid because of the intrusion of 
Motiy. tive variable errors. There is, also, 4 wider problem of 
ation, i.e., of willingness to be tested. 


- The claim has been made from time to time that a speed 
sons based on mental tests. 


enters into racial compari i 
lower S, Indians, and members f other races are said to be 
actor in their reactions than whites. It is not clear that this 
situati 1S generally present, although 12 many special cases and 

ons it should undoubtedly be taken into consideration. Quite 
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probably speed of response is not a true hereditary racial factor. 
Klineberg (1928) compared groups of Negroes and Indians in this 
regard. Some of them lived on reservations or in rural settings: 
Others were city dwellers or were students in good colleges. He 
found that with the latter, speed of reaction to test situations was 
much higher than with the former. 


3. Conclusions 


From the various considerations that have been discussed, cet 
tain broad conclusions have emerged that can be regarded as 
reasonably well established. 

A. It is possible that true hereditary racial differences in men- 
tality and ability exist, but how and to what extent they reflect 
themselves in test performance is not known. Klineberg (1935 a) 
made an often quoted study of the effect on the mentality of 
Southern Negroes of migration to the Harlem district of New 
York City, which is a very intensive urban environment. AS will 
be seen from Table 46, he found a steady rise in IQ. parallel £9 
length of residence in Harlem. The tests he used were the Stan- 
ford-Binet, the National Intelligence Tests, the Otis Self-Admin!s- 
tering Tests of Mental Ability, Intermediate Examination, the 
Pintner-Paterson Short Scale, the Minnesota Paper Form Boats 
and the Curtis Test of Arithmetical Achievement. These test 
were given to a total population of 3,081 Negroes resident ka 
Harlem, some of whom had migrated from the South. In all the 
verbal tests except that in arithmetic, there was a steady rise 14 
Scores proportional to the length of stay in New York. Klinebere 
tried to show that migration to New York was not selective, 1€” 
that it was not confined chiefly to the abler Southern Negro res” 
dents. He drew this conclusion from school records in Nashville 
and Birmingham, showing that those who moved north were not 2 


superior group. But educational records in Southern Negro schools 
are not very trustworthy, 


l I so a doubt arises. Table 46, showing 

relationship between average I.Q. and leneth of residence in D? 
York for ro-year-old Negro girls, is a fairly typical citatio” 
from Klineberg, though he presents a considerable mass of ot er 
data. It will be noted that the actual change in test performan®® 


paralled to length of New York residence is not very great. 

Klineberg’s demonstration is not so assured as is sometimes SUP” 
posed. Still Garth (1937), summing up many studies including 
this one, reaches the conclusion that differences in test perfor™ 
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ance 
as P F à r 
as between different racial groups can 1n the main be ex- 


Plai : 

E aetna and opportunity. This, of course, is the 

it may Ae a and without entering into a discussion of causes, 

racial fact said that such differential scores certainly reflect non- 
ors and the circumstances of life and do not separate out 


Or isol x 3 
ate hereditary racial factors with any clearness. 


TABLE 46 


EGROES CLASSIFIED BY LENGTH OF 


Sr. 
ANFORD-BINET I.Q.’s oF N 
K CITY 


RESIDENCE IN NEW Yor: 
(Klineberg, 19354: p. 46) 


Group Classification N Average 1.Q. 
42 81 
40 84 
40 85 
46 89 
47 87 
99 87 
215 85 


interpretive sum- 
ø to study the psychology 


Mar 
Y, propose that instead of attempting 3 ? 
gy of specific census 


he psycholog 
such as the Cherokees 


2 . 

ne Five Civilized Tribes of Oklahoma, the children of Welsh 

seme in Wilkes-Barre, Pennsylvania, and so on. Such census 

po i are defined in terms partly of ecology, partly of anthro- 

tion o? partly of political science. This is in line with the resolu- 

at it passed in 1939 by the American Anthropological Association 
S New York meeting (4-.)» which was as follows: “ (1) Race 


iny i woe 
olves the inheritance of similar physical variations by large 
nological and cultured connota- 


8toups of mankind, but its psyc f i 
io n ascertained by science. (b) An- 
any Pology provides no scienti c basis for discriminatior against 
d Mt on the ground of racia inferiority, religious affiliation, 
eee nguistic heritage.” This implies that psychometric instru- 

ts, if the scores they yield are to have any meaning at all, 


ning groups, 
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must be brought to bear upon and interpreted in terms of actual 
functioning groups of human beings. 


When so used they can be of great and authentic service in fg 


many ways in helping to deal with the complex and troubled 
issues of race. By way of a single example, M. D. Jenkins (q.v) 
undertook a psychometric study of 8,400 Negro children in the 
Chicago schools. The ablest of this population were selected bY 
nomination by their teachers, and of these again the ablest, num- 
bering in all 103, were measured by the Stanford: Binet scale. The 
results are shown in Table 47. It will be seen that all the intelli- 


TABLE 47 


DISTRIBUTION oF 1.Q.’s oF 103 SUPERIOR NEGRO CHILDREN 


LQ. Frequency 


—_— 


out, to discover and foster ability and to uncover the facts neces 
sary both for research and for Practical decisions. And as he als0 


yy 
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r T” , 
aes intelligence tests should be supplemented by the use 
tude e best obtainable measures of “stability and special apti- 


GENERAL CONCLUSION 


of applied mental measurement 


„Tt is clear that the four avenues 
ught to light findings not only 


oo in this chapter have bro t 
Fa Sry in themselves but of the first importance to psychomet- 
is to self. The basic logic embodied in psychometric instruments 
Ara up a working conception of some ability or process, to 
these test items with respect to it, and to interpret responses, to 
grou items in terms of the performance of a standardization 
Pioa, taken as representative of the distribution of the ability or 
aina in a much larger population or universe of discourse. The 
zatio us difficulty is the representative character of the standardi- 
Pris n sample. If it is drawn from one census group, or one racial 
stated or one socioeconomic level, or from persons with a certain 
Hee, amount of schooling, the norms it yields may lead to falsi- 

Gilead when they are applied elsewhere. Even when, as with the 
propor | Stanford-Binet, the standardization sample is selected in 
it ma: tion to occupational distribution inthe country as a whole, 
is al y easily be biased in other respects. In any case, such a group 
isa eeg an average sampling of many subgroups, and thus there 
of th anger of falsification when the norms are applied to any one 
not em. Such distortions and misunderstandings of norms need 
ir occur. But if they are not to do so, we must constantly bear 

n mind the real basis of comparison and evaluation on which the 


test is built. 

aoe additional major point which emerges from this chapter is 
ci t the results of tests so constructed invariably reflect the total 
‘itcumstances of life that surround and affect the persons tested, 


and cannot be isolated from them. This enhances rather than 
ees from the psychological content, meaning, and usefulness 

tests. But it must never be forgotten. Tf test scores are taken 
d revealing absolute, isolated mental components, the same for 
thi human beings irrespective of circumstance, the most deplorable 
is nterpretations are certain. If we remember that a mental test 
'S an instrument for evaluating an unknown individual, the sub- 
Ject, with reference to the perfo f a standardization group 


rmance 0 
aken as a sample, and to make the evaluation conveniently as a 
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numerical score, and if also we remember that the ee 
formance will reflect in some measure all the actual sar 
of his life, then psychometric instruments can be used for ere 
constructive purposes. They can reveal much that would O hey 
wise go unrecognized and unknown, and the numerical scores eg 
afford can be guide lines in the endeavor to understand an 
better human conditions. i 
Nor can it be said with justification that such tests are merely 
ad hoc practical instruments, without theoretical validity or it 
nificance. The performance of a standardization sample on we 3 
chosen items focusing in a well-selected concept really does Fey 
resent the functioning of the mental process or ability so o 
ceived in a particular setting. If it were possible to get m 
basic components of human mentality in the abstract and a 
spective of circumstance, perhaps far better tests could be ma i 
But the science of psychology has not progressed so far, and pa y 
haps it never will, Perhaps its true task will always be the stu a 
of human mentality as it actually manifests itself in innumerab J 
settings; and if this is so, the universal test unaffected by ort 
economic, or family, or educational factors, and dealing WÍ 


something which is constant for all groups, is not only an unatta!” 
able but also a false ideal, 
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Discussion 


awn from the fact that verbal 
cally privileged groups an 


QUESTIONS FOR 


x. s 

intel] onide what inferences can be dr 
rban ce tests “favor” socioeconomi 
d Whee 
you eect characteristics of a family and home 
hat ch ct to be associated chiefly with seperior test performance? 
at ae would probably not be associated at all with it? 

3. To racteristics might be associated with poor test performance 

b ‘art what extent do you consider the limited conclusions drawn 
effect er in the item above in the additional readings regarding the 
of different environments on twin resemblance compatible with 


e S 
4. p ings of Newman, Freeman, and Holzinger? , 
little ep usually believed that an ordinary school environment has 

effect in producing better mental test performance. If this is 


ue, 
5. ghat reasons can you find to explain it? 
Probie cet the bearing of the material in this chapter upon the 
Teall of test validation, i.e., the determination of what a test 
Y measures. 
ance i test performance is responsive to the various types of influ- 
show aad in this chapter, might this be used as an argument to 
e Ae tests are worthless? How might such an argument 
in The zamine in detail the various criticisms made by Goodenough 
they item listed above among the additional readings. How far do 
refers eem to you correct in view of the data and contentions she 
ond What answers, if any, can you find to them? 

xamine the defense put forward by Stoddard in the item listed 


environment would 
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above among the additional readings. To what extent does he seem 
to summarize and interpret correctly the studies to which he refers? 
Can you find any replies to his contentions? 

9- Consider the idea that mental tests reveal absolute, abstract 
isolated mental functions and abilities as such. What reasons are 
there for and against such a view? What would be some of the theo- 
retical and practical consequences of maintaining it? 

ro. If certain racial groups and certain socioeconomic groups make 
better responses relatively to performance tests than to verbal tests 

A 


does this mean that performance tests are superi 
4 or to verbal 
If not, what does it mean? a BN 


OOO i F ek 


CHAPTER X 


WIDER PSYCHOLOGICAL ISSUES IN MENTAL 
TESTING 


INTRODUCTION 


Having sought to discover what light is thrown upon the psy- 
chological content and meaning of mental tests by the work that 
has been done in their major areas of application, we now turn 
to certain wider and more general psychological issues involved 
ìn them. These are the problem of the constancy of mental traits, 
the problem of the nature and course of mental growth, the prob- 
em of the distribution of mental traits, and the problem of heredi- 
tary and environmental influences in determining mentality. All 
these four problems, together with the topics discussed in the last 
Chapter, come to a focus in one culminating and inclusive prob- 
em; namely, that of the psychological :significance of test per- 
jp nance, which is sometimes formulated, though in an unduly 
‘mited way, as the significance of deviates. 


_ THE Constancy oF MENTAL TRAITS 


l. The problem 


Granted that a person displays a certain pattern of mental 
abilities, is he likely to retain that same pattern over different 
oe Perhaps extended periods of time and under different circum- 
i ances? Is a person’s mentality likely to stay the same, or is it 
‘ €ly to change? This is the problem of the constancy of mental 
raits or abilities. 
o t is one of the most generally misunderstood of all topics 
Funected with mental measurement. First, it is continually con- 
ee by being identified with the question of the constancy of 

€ intelligence quotient. But this is putting the cart before the 
tie The intelligence quotient is nothing more than one of the 

ices or units often used for expressing one of the aspects of 
Mentality in numerical terms. Granted that it is a properly chosen 
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and derived unit, then if a person’s mentality remains the same 
over a period of time, his I.Q. also should stay the same. But if 
his mentality changes, then so should his T.Q. So the issue is with 
the individual’s pattern of mental abilities and its tendency to 
remain the same or to change; and the enormous mass of data 
which has been accumulated on the constancy of the I.Q. is sig- 
nificant chiefly because it constitutes evidence bearing upon that 
issue. 

Again, the problem of constancy is often thought to have a 
much closer relationship to the question of the hereditary deter- 
mination of mentality than it actually has. If the I.Q. can be 
shown to be relatively constant, this is taken as proof presumptive 
of hereditary determination. If it turns out to be variable, this is 
supposed directly to prove environmental influence. But the mat- 
ter is not so simple. It would be possible to derive a person’s 
“height quotient” by dividing his height in millimeters by his age 
in weeks or months. From time to time that quotient would 
change, because there are periods when increase in height slows 
down or speeds up while chronological age goes steadily on. These 
changes would be due to constitutional or hereditary causes. 
Again, we might derive his “weight quotient” in the same way. 
But if this suddenly and markedly increased, we might suspect 
that he was over-eating. In the same way, a change in the in- 
telligence quotient indicates only that some cause has altered 
the previously determined relationship between the individual’s 
chronological age and his mental development. All we know is 
that a problem exists, for the cause of the disturbance may be 
hereditary and constitutional, or it may be environmental, or it 
may be both. Also, when the I.Q. remains constant, this again 
simply defines a problem. The constancy may be due to a heredi- 
tary pattern of mental abilities that resists all outside influences, 
or it may be due to prepotent uniformities in the environment, or 
it may be due in part to both. The question of constancy is one 
of fact. On the fact, so far as it can be ascertained, it is allowable 
to build any opinions we desire, but the fact does not in itself 
impose opinions. 

Although the problem of constancy does not have the imme- 
diate implications often attributed to it, nevertheless it remains 
one of the most practically important in psychometrics. If men- 
tality remains substantially unchanged over long periods of time, 
then great reliance can be placed on one single carefully obtained 
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and carefully evaluated test performance. If, on the other hand, 
it is subject to substantial alteration, then frequent re-surveys are 
in order. If certain circumstances do not affect it and other circum- 
Stances do, then it is important to know what these circumstances 
are. But once again what we have to deal with are not general 
theories, but matters of fact on which wise practical decisions can 
be based. 


2. The constancy problem and the constancy of the I.Q. 


The bulk of available evidence, though not all of it, regarding 
the constancy of mental abilities is contained in the studies on 
the constancy of the intelligence quotient. The outcomes of these 
studies will be summarized, and their significance interpreted as 
compactly as possible. : . 

A. The broad evidence is that under ordinary circumstances 
the intelligence quotient is constant within certain clearly defined 
limits. In Table 48 are shown the data which Terman assembled 
Many years ago on this point. He summarizes them as follows. 
“(1) The central tendency of change is represented by an increase 
of 1.7 in I.Q. (2) The middle fifty percent of change lies between 
the limits of 3.3 decrease and 5.7 increase. (3) The probable error 
Of a prediction based on the first test is 4.5 points in terms of 
LQ.” (Terman and Others, p. 142). Rugg and Colloton (q.v.), sum- 
Marizing the results obtained up to 1921 on the Binet tests, involv- 
ing large numbers of subjects, find that for half the cases over 
Unspecified periods of time the average changes are less than 6 
Points increase and less than 3 points decrease. The results re- 
Ported by Gray and Marsden (q.v.), who summarized studies 
Using the Stanford-Binet scale, are shown in Table 49. They find 

at the central tendency of change is +2.25 points, and that the 
middle 50% of all changes lie between 7.7 Increase and 2.25 de- 
Crease, Baldwin and Stecher (q.v.) used the Stanford-Binet scale 
With 485 children, and report that most of the changes on retest- 
hg were within 5 points and that correlations between test and 
retest were from .72 to .94. The latter coefficient is much more 
representative than the former of many others that have been 
reported. Thus L. S. Rugg (q.v.) gave 114 pairs of Binet tests at 
tervals of from 4 to 36 months to an unselected group of chil- 

Ten with 1.Q.’s from 73 to 133 and found a correlation between 
testing and retesting of .948. However, caution must be observed 
in interpreting these high correlations. They do not mean neces- 
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TABLE 48 


CHANGES IN 1.Q. AMONG 315 PERSONS RETESTED OVER DIFFERENT 
PERIODS UP TO SEVEN YEARS 


(From Terman et al., 1917, Table 25, p. 141) 


INCREASES | DECREASES 
Extent of change N Extent of change N 
Above 20 .....+-- sees 2 Below 20 ites 7 
o B 
I o 
2 3 
3 2 
2 3 
3 2 
I 7 
2 8 
5 4 
3 7 
5 7 
7 12 
9 19 
F uaou soros 20 29 
14 15 
o i 
18 25 
o 34 
18 20 
I 23 18 
No change sien:asia sss 


sarily that the I.Q.’s have not changed, but only that individuals 
tend strongly to maintain their relative positions. 

To sum up, when a child is tested with the Stanford-Binet scale 
between the ages of about 4 to 15, these being the ages at which 
the scale gives the most stable results, there is a 50% probability 
that his I.Q. will remain constant within 5 points or so up or 


< 
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TABLE 49 


Cyances IN I.Q.’s For Various TIME INTERVALS 


(From Gray and Marsden, tabulation by Nemzek, 1933 b, Table 2, p. 144) 


sree, Per] aatan erm 
Testings] N r 50% of paneer : iO, arean 
differences | changes CRONE? testings 
Tand 2! roo |.89 + .org|—2.25 to 7.66] 4-95 2.25 
2and 3| 55 |.gr = .016 | —3.03 to 3.00) 3-01 0.00 
Tand 3] 63 |.84-.osg|—1.00to7.25| 412 3.50 
all 218 |.88 + .036 | —2.70 to 7.00 4.85 1.60 I-2 
Tand 2] too |.88 + .org|—2.25 to 7-70] 5:00 2.2 $ 
Tand 4} 371 |.85 + .orr 1-3 
Tand 6| 616 |.85 + .008|—6.10 to 4.70} 5-50 —1.30 1-5 


down over an unspecified period of time. This, it will be noticed, 
1S a very carefully qualified statement, and it opens up a number 
of further problems. ; 

B. For one thing, if the middle 50% of changes are likely to 
ate range of 5 points more or less up or 
of changes that are more extensive. How 
arge are they likely to be? A considerable number of quite exten- 
Sive ones are to be expected. Thus Psyche Cattell (1937), in con- 
hection with the retesting of 1,300 children with the Stanford- 
Inet scale, involving 3,331 comparisons in all, reports that 4 
Individuals, or .3% of the group, made changes of over 40 points, 
that 1% of the group gained 30 points or over, that 5% gained 
20 points or over, that 10% gained 15 points or over, and that 
25% gained 8 points or over. Nemzek (1933 a) and Robert Thorn- 
Alke (1940) also report a considerable number of large changes 
in T.Q., as will be seen from Tables 50 and 51. Two of Thorndike’s 
Cases made the sensational gain of 50 points, and one of them 
the almost equally sensational loss of 45 points, while gains re- 
Ported by Nemzek ranged to 32 points and losses to 22 points. 


* 
limi 


he within an approxim 
Own,* this leaves 50% 


lize that this specific figure is a rough one. The obtained 


The re: al 
merima for various studies have already been reported. 


ts of the middle 50% 
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TABLE 50 
T.Q. CHANGES IN 52 CHILDREN RETESTED AFTER 2 YEARS È 
(Nemzek, 1933 a, Tables 2, 3, P- 476) 
GAINS Losses 
STANFORD-BINET ...+ee+--eeee-.| Range 2-32 pts. I-21 pts. 
Median 8.93 6.00 
Mean 10.82 6.22 
Number | 33 18 
z: ———— 
HERRING-BINET sise iwe ssis ds oces Range 1-22 2-19 
Median 10.63 6.00 
Mean 10.66 7.72 J 
Number | 33 18 
TABLE 51 


CHANGES IN I.Q. AMONG 1,100 PRIVATE SCHOOL CHILDREN AFTER 
2% YEARS 


(After Robert Thorndike, 1940) 


INCREASE DECREASE 
Points | N Points N 

BO) akuoe uasi geen ses 2 
AS. e erik ara wi gih 3 45 P 
JO aas eenee isis iian oo 7 40 í 
Be diarad anii onns ORD II 35 5 
BO} anew titan same 15 30 $ 
25 iets vparaie siae naka aha | 45 25 8 

T | 60 |20 ie 
20 seeeeerere | 


; Hirt (q.v.) are less extensive. 

f hanges recently reported by . 
T a retests with the r916 Stanford-Binet were run on 1357 
il at varying but substantial time intervals. Over 46% varied 
less than 6 points, almost 7 5% less than 11 points, less than 10% 
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more than r5 points. The most striking change was a drop of 38 
Points, from 68 to 30, between C.A. 8-3 and 17-10, this accom- 
Panying a case history of epilepsy secondary to organic lesion. 
More generally, Hildreth (1943) finds that the average retest 
Scores on the Stanford-Binet of children of 130 I.Q. or more tend 
to run higher. And Mildred Allen (1944, 1945) finds that Kuhl- 
mann-Binet testings in grade x are not closely related either to 
academic achievement or indices of intelligence in grade 4. 

Three comments are in order in regard to these extensive 
changes and their frequency. (a) The facts seem to call for some 
revision of Terman’s earlier opinion that I.Q. increases of 20 
points or more were to be expected only in x or 2 cases per 1,000. 
Apparently they occur much more often and go far beyond 20 
Points. However, his claim that half of all changes are likely to 
be between about 5 or 6 points advance and 4 or 5 points decrease 
Still stands. (b) Such changes in no way invalidate the I.Q. as a 
unit of measurement, any more than a balance is invalidated 
when it records a change of weight. The question of the validity 
or, better, the stability of the I.Q. as a unit of measurement de- 
pends on quite other considerations. (c) It must not be supposed 
that because large changes are very much less common than small 
Ones, they are therefore unimportant and to be disregarded. In 
Science it constantly happens that an unexpected deviation that 
Only occurs once in thousands of times opens up a whole new area 
of investigation and explanation. The relatively high frequency 
of small I.Q. changes is, of course, an important datum. But large 
changes, although comparatively infrequent, are data just as im- 
Portant. No account of the constancy problem which disregards 
them can possibly be well founded,- 

C. The constancy of the I.Q. is regularly affected by certain 
Conditions associated with the administration and statistical char- 
acter of the tests themselves. 

(a) The variability of the I.Q increases with the increase of 
time between testing and retesting. Robert Thorndike (1933), in 
an elaborate statistical study on the effect of time interval on 
Sianford-Binet 1.Q.’s, concludes that a correlation of .88ọ is to 
be expected between quotients based on testings following one 
another immediately, a correlation of .814 when the interval is 
3° months, and a correlation of .698 when the interval is 60 
PR = R. Brown (1933 b) AE that a similar but somewhat 

r drop in constancy is to be expected. The correlation of 
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1.Q.’s obtained with a time interval of 1 year is .86, and with a 
time interval of 9 years is .78. An average change of I.Q. points 
over a year is + 5.36, and over g years = 9.34. Brown (1933 a), 
in another study based on Stanford-Binet 1.Q.’s of 124 problem 
children, finds that mean changes over periods from 60 to 145 
months are twice as great as those over periods of 1 to 24 months. 

(b) It has been reported from time to time that high 1.Q.’s tend 
to rise and low ones to fall. Thus Psyche Cattell (1931) established 
this trend in the testing with the Stanford-Binet scale of 1,183 
children, which was repeated at least twice with each child, at 
time intervals up to 72 months. For time intervals of longer than 
6 months, which meant the virtual elimination of practice effect, 
those in her highest I.Q. category made a mean gain of 16.0 points, 
and those in her lowest category made a mean loss of 7.5 points. 
Also, Goodenough (1928 c), using the Kuhlmann-Binet scale in a 
study of 380 young children, found consistent increases in mean 
LQ. on retesting after 6 weeks. The mean 1.Q. of 2-year-olds rose 
from 105.1 to 108.1; that of 3-year-olds, from 104.4 to 107.6; that 
of 4-year-olds, from 109.4 to 116.0. As she says, this has the ap- 
pearance of a practice effect. But the gains were chiefly among 
children of the professional and business classes, hardly any oc- 
curring among children of, day laborers. Various suggestions have 
been made to explain this phenomenon, as, for instance, that bright 
children read and study more, which may or may not be so, and 
that the subtests of the scales become more verbalistic as they 
advance, thus increasingly favoring the abler and hampering the 
less able individuals, which seems much more plausible. 

(c) The intelligence quotients of very young children are likely 
to be decidedly more unstable than those of older children. Some 
of the evidence bearing on this point has already been discussed 
earlier in this book in connection with tests for young children. 
In addition, the two studies by Allen already mentioned, and the 
work of Katz (g.v.) may be cited. She administered the Stanford- 
Binet scale to 160 girls and 148 boys between the ages of 3 and 5, 
the testings coming at intervals of 6 months. Of this group 40% 
varied more than 20 points, and only 20% were in the same classi- 
fication on all five testings. Here again the suggested explanations 
are many and varied, though they are not on the whole inconsistent 
and may each of them contain part of the truth. Early instability 
has been attributed to rapid and uneven early mental growth, to 
the cumulative effect of the environment during formative years, 
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and to the unreliability of tests for young children. This last 
factor, however, cannot entirely explain changes trending strongly 
1n one direction, such as those reported in connection with the 
effects of preschool attendance. 

D. This final comment opens up a question which may well 

have been already in the reader’s mind. What of changes in the 
LQ., either positive or negative, associated with some special set 
of circumstances such as those considered in the last chapter— 
favored or unfavored socioeconomic status, special educational 
stimulation, continuation in school, good or bad home conditions, 
and the like? The point to understand is this: All such special 
changes are significant only as deviations beyond the range of 
€xpected constancy. If the intelligence quotient were not constant 
within the limits and with the qualifications already set forth, 
fluctuations and changes associated with this or that particular 
Condition would be in no way striking and would suggest no prob- 
lems and call for no explanations. 
_ E. The general conclusion, then, must be that the massive data 
In regard to the intelligence quotient indicate a considerable but 
by no means absolute constancy of mentality. They indicate the 
limits within which changes are likely to take place. And when 
Sreater changes seem to occur, a problem immediately arises. 


3. Other psychometric evidence 
Outside of the studies on the constancy of the I.Q., the psycho- 
Metric evidence on the constancy of mental traits and abilities is 
rather scanty and not very precise. However; it is without doubt 
Confirmatory. Thus Hollingworth and Kaunitz (q.v.) found that 
82% of a group of 116 children in the top centile in I.Q. of all 
those tested on the Stanford-Binet scale were in the top cen- 
tile ro years later when measured with the I.E.R. Intelligence 
Scale CAVD and Army Intelligence Examination Alpha. Lamson 
(1930), too, has strikingly shown the maintenance of intellectual 
Status. She studied 56 gifted pupils who were in special oppor- 
tunity classes in elementary school, following them through high 
School, and comparing each child with a paired control from the 
Same elementary school and of the same sex and grade classifica- 
tion. In regard to the determination of their intelligence, they 
Were tested three times with the Stanford-Binet scale in elemen- 
tary school, and twice with Army Alpha in high school. The mean 
-Q. of the gifted group was 155 and the range was from 135 to 


348 PSYCHOLOGICAL TESTING 


190. The first testing with Army Alpha was at a mean chrono- 
logical age of ro years 11 months. At this age 33% of them scored 
in the “A” classification, which as may be seen from Table 7 
means a test performance equivalent to that of the top 5.14% of 
the white draft in World War I. The second Alpha testing was 
at the age of 15, and on this occasion all of them scored in the 
“A” category. Just how significant this is may be gathered from 
Table 52. No other group there listed scores 100% A. The com- 
parison between these children and the group of graduate students 
is particularly striking. This study of Lamson’s is a typical fol- 
low-up investigation, and other work of the same kind confirms 
the finding that mental status tends to be retained. 


ha TABLE 52 


PERFORMANCE OF GIFTED HICH SCHOOL Group on ARMY ALPHA 
COMPARED TO THAT OF OTHER GROUPS 


(Lamson, 1930, Table 5) 


Percent 
Group measured getting “A” N Mean C.A. 
Alpha scores 

Hollywood High School Seniors. . 34.1 211 

High School Seniors ........... 37-7 635 17.4 
University Students ..... seed 51.5 5950 

Library Personnel .....-.-+++++ 60.0 296 

Oberlin College Freshmen.....-. 70.0 330 

Graduate Students .......++++++ 81.0 252 

Hotchkiss Seniors ....+++++++++ 81.0 75 

Yale Freshmen ..s.essesereseee 85.5 400 r 
Gifted Group ......eeeeeeeeeee 1.00.0 54 15.0 


Beyond the measurement of intelligence itself, little psycho- 
metric work seems to have been done on the constancy problem. 
About the only significant investigations relevant to it are those 
dealing with interest, to which reference has already been made. 
As will be recalled, interest patterns are found to be fluctuating 
and uncertain in childhood, but to become stabler later on in life. 
And although they do change with the years, these alterations 
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seem usually to amount to stronger feelings of liking for what is 
earlier liked very well, and decreasing inclination for what is not 


liked very well in earlier years. 


4. Evidence from beyond psychometrics 

So far as the present writer’s knowledge goes, the general prob- 
lem of the constancy of mental traits and abilities has not received 
much attention in psychology outside the field of psychometrics. 
There are at least three apparent reasons which would make this 
understandable. First, it is natural and obvious to assume that 
the general mental make-up of any individual will remain about 
the same throughout his life. This corresponds to ordinary experi- 
ence, and the assumption has been accepted without elaborate 
investigative proof. Second, in most fields of psychological re- 
Search the problem of constancy is not paramount. It only becomes 
So when we are trying to construct measuring instruments and to 
devise criteria which can be applied to an individual at various 
ages. Third, radical and striking changes in the personality and 
Mentality of human beings have not forced themselves upon the 
attention of psychologists, presumably because they do not often 
occur. If they were common, it is safe to say that they would 
have been made the subject of investigation long before today. 

Of course, the whole psychology of learning bears upon the sub- 
Ject. It is known that (a) human beings can certainly learn a 
great many more things than they do learn, (b) the effects of 
learning can be lasting rather than superficial and transient, (c) 
the capacity to acquire new abilities continues until late in life. 

uman beings, in other words, are highly adaptive, and the actual 
extent of any individual’s adaptation depends on circumstances 
and is never fully actualized. Thus mentality is very far from 
fixed or absolutely constant. No doubt, however, learning capacity 
and adaptability have definite limits, though just what they are 
and what determines them is not well understood. 


3. Conclusion 

All the evidence on this problem hangs together quite consist- 
ently, The findings on the constancy of the intelligence quotient 
are quite in line with other psychometric evidence and with the 
8eneral trend of psychological thought and investigation. Those 
findings have dealt with the problem in specific quantitative terms, 
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They indicate that in the majority of cases, and under ordinary 
circumstances, the mentality of human beings may vary some~ 
what but not a great deal, that variability depends upon the length 
of elapsed time, the nature of the tests used, and the age at which 
mental status is determined, but that in a significant and not 
inconsiderable number of cases large though not unlimited varia- 
tions can take place. In all probability this corresponds reasonably 
well to the true facts. It in no way prejudges the issue of whether 
mentality can be changed by properly chosen influences, favorable 
or unfavorable, consciously brought to bear. Again, in all proba- 
bility this can be done, in view of our total knowledge of the 
psychology of learning and of the phenomena of mental growth, 
which is the next major topic to consider here. But once more, 
the possibilities of such change are far from limitless. 

This seems to be a reasonable position, in the light of all the 
evidence. It in no way undermines the foundations of psycho- 
metrics, for it provides a quite sufficient basis for the construction 
of significant mental tests. Those tests do not measure phenomena 
that are invariant or absolutely fixed, or indeed nearly so. But 
such a supposition would surely be intrinsically untenable, con- 
sidering that we are dealing with the living, changing, adaptive 
human being. However, human nature is by no means so fluctuat- 
ing and uncertain as to render all attempts at measurement and 
prediction futile. 


MENTAL GROWTH 


The topic of mental growth is in a sense complementary to 
that of constancy, for it has to do with the sequential changes 
that take place in a person’s mentality and behavior patterns 
during the course of his life. Those changes are due partly to 
organic maturation, and partly to experience and learning. The 
specific relevance of this topic to psychometrics lies in the fact 
that many tests and measures offer interpretations based upon 
the relationship between chronological age and mental ability. Its 
broader relevance turns upon the question of whether and to 
what extent psychometric instruments can give an account of the 
changes in the way of mental development and decline within the 
constant framework of the person’s individuality. The subject 
itself is, of course, a very large one, to which a very great number 
of investigations have been devoted. Only those aspects of it 


fe 
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which are directly related to psychometric problems will be sum- 
marized here. 


l. Early mental growth 

Early mental growth is characterized by the rapid emergence 
of new functions and differentiations. Gesell and his associates 
(v. Gesell and Amatruda), to whose work reference has already 
been made, have carried out elaborate studies that reveal some 
of these changes. Thus an infant at the age of x month lifts his 
head from time to time when held at the experimenter’s shoulder. 
At 8 months he sits momentarily without support. And at 2r 
months he walks attended on the street. With regard to language, 
at t month he gives definite heed to sound. At 8 months he gives 
Vocal expression to recognition. At 21 months he repeats things 
that are said. With regard to what Gesell calls adaptive behavior, 
at r month he stares at a massive object presented in his field of 
vision. At 8 months he definitely looks for an object fallen on the 
floor. At 21 months he differentiates between a toy tower and a 
toy bridge. These are a few sample items taken from Gesell’s - 
developmental schedules, and from them the reader can gain some 
Impression of the complex and varied differentiations and inte- 
8rations referred to by the term mental growth. ; 

Such studies are of great importance. They support and amplify 


Our general notions as to the nature of the growth process, and 
they reveal its specific phenomena. But they also make it clear 
that the growth concept is one that involves very serious psycho- 


Metric difficulties. i i 
A. In the first place, the term growth evidently includes the 


evolution of more or less separable functions which have their 
Own developmental rhythms. Thus is it necessary to distinguish 
etween motor development and linguistic development. As Bayley 
1933) and others have shown, motor development proceeds more 
Tapidly and decisively during the early months of life than lin- 
guistic development. Also, it is necessary to consider as a more 
or less separate category what Gesell calls adaptive behavior. ; 
But this is by no means all. Each of these broad divisions 
Probably contains within itself an indeterminate number of sub- 
divisions that are just as real and important. Is it, for instance, 
€gitimate to assume that the child’s linguistic reactions as de- 
Scribed by Gesell at the age of ı month are psychologically con- 
tinuous and homogeneous with those at 21 months? What is 


: 


352 PSYCHOLOGICAL TESTING 


lumped together as language is really a very complex constellation 
of functions. Thus Lewis (q.v.) in his study of infant speech 
shows that a few hours after birth comfort and discomfort sounds 
can be discriminated. At 2 to 3 months there appear babblings, 
i.e., comfort sounds pleasant to make for their own sake, which 
according to the interesting suggestion of Lewis may be the origin 
of aesthetic interest in speech and sound. Between 1 and 4 months 
there is much rough imitation of adult speech sounds, but after 
4 months imitation seems to become much rarer, because meaning 
begins to be prepotent. After 6 months, however, imitation re- 
appears accompanied by echolalia. Then there is a rapid accumu- 
lation of phonetic forms and new concepts. Stumpf (q.v.), again, 
finds that early vocalization is neither speech nor song, but the 
matrix of both, speech moving in the direction of increasing con- 
trol by symbolic meaning, and song in the direction of increasing 
control by pitch. So, too, Gesell’s adaptive behavior seems hardly 
more than a topical category covering many real differences. 

A good diagnostic and predictive account of a child’s early 
growth will give great weight to the relationships manifested in 
the development of these discriminable functions. General retarda- 
tion is an unfavorable sign. The precedence of speech to walking 
is usually a very favorable one, and so on (v. Gesell, 1940). But 
it becomes very much of a question whether a global over-all 
index such as mental age can ever represent the significant phe- 
nomena of early growth. Such a measure or index can only be an 
average, concealing and containing within itself many important 
differences. This is a major problem in the construction and inter- 
pretation of mental tests for young children; and, as we have 
seen, criticism of the use of global indices extends also to their 
use far beyond infant levels. 

B. It is usually believed that mental growth in the early months 
and years of life is very rapid. Certainly this seems true to anyone 
who watches a growing infant. And it is compatible with the 
general phenomena of organic growth. At any rate, the assumption 
is often made. Thus Terman and his associates often remark that 
a year of mental age means a much greater real change at an early 
age level than at a later one. But there are no psychometric means 
for telling just how rapid early growth really is, compared to later 
growth. Indeed, some growth curves have been developed, as by 
Gesell (1929) and Bayley (1933), which show positive accelera- 
tion, i.e. a slow initial advance that becomes faster. But they were 
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Plotted in such a way as to make this inevitable. Rapid early * 
ae growth is probably the reality, but exact quantification is 

CKing. 

C. It has often been claimed that early growth is peculiarly 
Subject to environmental influence. Again, this is quite probably 
true in some important sense, but we do not know with any 
€Xactitude in what sense. Older persons can certainly learn and 
Change. But perhaps the changes cannot be as profound in some 
Way or other as those that can take place early in life. Psycho- 
analysis assumes the prepotency of early influences upon develop- 
ment, and no doubt with justice. The argument has been used to 
explain the remarkable effects claimed for preschool attendance, 
and it is not implausible. But we have no conclusive proof of its 
truth and no quantitative definition of its meaning, 


2. The continuation and culmination of mental growth 


Two problems arise here. The first is the significance of the 


So-called age of arrest. The second is the nature and reality of 
Mental growth continuing into the later adult years. 

- The age of arrest is primarily a psychometric rather than 

a general developmental concept. It is important that this should 

e clearly understood. It means the chronological level at which 

€ regular mean improvement of test scores with increasing age 


Ceases to manifest itself. The issue is an old one. Terman (1917), 
With his Santee group for the upper levels of the Stanford 
“evision of the Binet scale, which consisted of “normal adults, 

i.e., businessmen and older high schoo! pupils, found that this took 
Place at about 16 years. So this was recommended as the age of 


arrest, t ominator in computing the intelligence 
, to be used.as teiden Yerkes (1921) found that a 


8roup "5 it was suggested, should be computed 
on Se eT T ich of course would raise them 
very materially. Terman himse £ 


tion f 
sayi t the Army tests were . 
adminis bewildered and disoriented men, and so forth. 


The Problem has been found very challenging, and numerous 
investigators have devoted attention to it. Thus Thorndike (1923, 
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1926) gave tests at yearly intervals to large groups of subjects 
from 13 to 19 years old, and found advancing scores up to the 
latter age. Teagarden (1924), again, gave four intelligence tests to 
408 subjects from 12 to 20 years old, and found a steady advance 
in scores up to 18, with no reason to suppose that for normal 
individuals mental advance was arrested even then. 

A good deal has been made of such disagreements, evidently 
in the belief that some very far-reaching issue was at stake. But, 
to repeat, the age of arrest is primarily a psychometric concept. 
It does not mean that there can be no mental development beyond 
14, 16, 18, or 19 years, which would be preposterous. To interpret 
it as meaning that organic or hereditary mental development, or 
maturation, ceases at one or other of these ages is gratuitous and 
merely a theoretical construction. In strictness, all that it means 
is that when a certain battery of tests is used, the parallel increase 
of mean scores with increased age ceases at a given point. This is, 
of course, of great importance for all age scales, since indices 
depending on a constant relationship between age and test per- 
formance must be re-interpreted above the point where that rela- 
tionship no longer holds. Also, it raises a question as to content, 
for there is the possibility that the battery is heavily loaded with 
functions, such as immediate memory, which have rather definite 
age ceilings. But in essence the problem is technological rather 
than general. 

B. This leads directly to another and broader problem, namely, 
the nature of mental development during the adult years. Regard- 
ing this there are some rather definite findings. 

When a test emphasizes speed and is loaded with highly manip- 
ulative and routine items which adults find trivial and annoying, 
it is apt to yield a picture of early arrest and even of actual decline 
during the adult years. This is what was shown in a study by 
Miles and Miles (q.v.). They gave a special form of the Otis Self- 
Administering Test of Mental Ability to 823 persons in age classi- 
fications up to 94. The high point of the mean scores was at 18, 
followed by a rapid decline, and at 50 the averages were back to 
the levels of the early teens. In the light of further work the 
explanation clearly is that the battery dealt with functions which 
show this type of developmental sequence. 

Attempts have been made with some success to show what 
mental functions are affected by age. Thus Jones and Conrad 
(g.v.) gave Army Group Intelligence Examination Alpha to 1,19? 
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residents of New England villages, ranging in age from ro to 60. 
Average scores based on all 10 subtests showed a peak at 19, and 
then a smooth drop. But the effect of the subtests was differential. 
The Common Sense and Analogies subtests showed the sharpest 
drop. Arithmetical Problems, Number Series Completion, and 
Scrambled Sentences showed some. Opposites and Information 
showed none at all. 

Willoughby (g.v.) obtained mean scores on rr tests for age 
8roups up to 6o. All the averages showed a rise into the late teens 
and early twenties. Then there was a decline for those involving 
Series Completion, Verbal Analogies, Verbal Opposites, Substitu- 
tion Learning (digit-symbol), and Information in History and 
Literature, Arithmetical Reasoning, however, showed no decline. 

Sorenson (q.v.) gave tests in vocabulary and paragraph reading 
to between five and six thousand adults. Of this total population 
he selected 641 to provide uniform groupings at five-year inter- 
vals from 16 to 70. The members of each of these five-year groups 
Were selected to correspond to the group 50-54 years of age in 
years of schooling and occupational status. The great importance 
™ holding these two factors constant in dealing with adults, if 
Comparisons of test performance are to be meaningful, is very 
evident. If, for instance, our adult groups below 3o are strikingly 
Superior in education and socioeconomic status to those above 40, 
an apparent decline of intelligence is extremely probable. With 
these two factors held constant, Sorenson found that mean vocabu- 
lary scores improved throughout the age range, and that para- 
8raph reading was maintained at an even level. He argues that 
alleged and demonstrated declines in mean test scores at later 
levels are largely due to subjects getting more and more out of 
Practice with test materials as they grow older. 

Veissenburg, Roe, and McBride (q.v.), too, have made an 
elaborate study of the problem, some of the details of which are 
Presented in Table 53. Their work shows very clearly that tests 

iffer in suitability for adults, sentence completion apparently 
being g good one, the later Stanford-Binet subtests standing up 
Well, vocabulary also being a good one. Performance tests, on the 
Other hand, are generally poor adult material. So is the Good- 
enough Drawing Scale. On the whole, according to their results, 
@nguage ability is well maintained with age, but other functions 
€nd to show a peak and a decline. 

hristian and Paterson (g.v.) gave a vocabulary test of 120 
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items to a group of university freshmen and to another group 
made up of their relatives and friends. The differences which are 
Shown in Table 54, slightly in favor of the younger group, are 
Not statistically significant. The authors also point out that the 
younger group was somewhat more highly selected for intelli- 
gence. Furthermore, the test involved a speed factor, and when 


TABLE 54 
Recocnition VOCABULARY Scores oF VARIOUS AcE GROUPS 


(Christian and Paterson, p. 168) 


Age Group N Median | S.D. Range 


18 year olds ....cs0e satis 200 88 17 41-115 
40-49 year olds 50 84 22 40-116 
5°-59 year olds .. 40 82 26 25-115 
60-69 year olds .. 30 79 2 10-114 


this was cancelled out the slight advantage of the student group 
disappeared. As a piece of research this is somewhat slight. But 
in conjunction with much confirmatory evidence its emphasis upon 
the vocabulary test as suitable for adults is significant and in- 
lcative, 

W. R. Miles (q.v.) presents a general report on the Stanford 
Later Maturity Study. The work in 1930 was based on 863 persons 
aged from 6 to ọ5. The work in 1932 was based on 1,600 cases. Age 
groupings as follows were set up: B, 10-17; C, 18-29; D, 30-49; 
re 50-69; F, 70-89. The averages show definite general declines 
Or the older age groups. On a maze test the scores for the five 
8toups were respectively 95, 100, 92, 83, 55- When a single rather 
eXperimental test intended to measure imaginative capacity was 
given, there was virtually no change in the mean scores for the 
various groups. Tests requiring the higher types of intellectual 
effort show in general a late maturity and a slow decline. Thus on 
a learning test the mean scores for the five groups were respec- 
tively 72, 100, 100, 87, 69- Comparison and abstraction as meas- 
ured by the ordinary types of test items show a peak at about 18, 
and then a decline. But according to Miles the function itself is 
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inadequately measured, which is extremely probable. All the aver- 
ages cited need to be interpreted in the light of the fact that the 
dispersions were large at all age levels, and that there was much 
overlapping. To give an idea of how extensive this was, 25% of 
the oldest group equaled or excelled the over-all adult average, 
even with speed as a factor. 


Figure 27 reproduces the curve of mental decline developed by 
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Fic. 27. Curve or MENTAL DECLINE (Test PERFORMANCE, 
WECHSLER-BELLEVUE SUBTESTS) 


(Wechsler, 1943, p. 56) 


Wechsler (1944), which is probably at least approximately true 
to the mean trend. There are many individual variations, how- 
ever, and much specialization in the changes of different functions. 
And as Wechsler himself remarks, although mental decline is 
a basic phenomenon of senescence, experience can serve for 
a long time to render an older person efficient te a superior 
degree. 

So, to summarize, routine functions and school-like functions 
show an early peak and a rapid decline. There is undoubtedlv 4 
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memory loss in old age (Gilbert, 1941). But reasoning ability and 
language functions may advance for many years and show only 
a very late decline (Gilbert, 1935). Moreover, it is clear that for 
any stable and dependable conclusions, the selection of older per- 
Sons is of paramount importance, for cumulative differences in 
Occupation, mode of life, general stimulation, and alertness can 
easily falsify a set of test findings. Moreover, too, in dealing with 
adult mentality, motivation is of great importance. It has been 
Shown that among eminent persons the greatest creative output in 
literature and science is between the ages of 25 and 4o. But as 
Lehman (q.v.), on whose work the above statement is based, 
Points out, these are the years when such persons are establishing 
their reputations and working hard to build a career. The infer- 
€nce for psychometrics simply is that a test situation which be- 
Cause of its setting and content is found ridiculous, or childish, or 
trivial by adult subjects, is almost certain greatly to misrepresent 
the truth as to their mental abilities. 

It is evident that this material is full of meaning for psy- 
Chometrics. It is entirely possible to construct effective adult 
mental tests. They must consist of suitable content. But what 
Such content needs to be is fairly clear. And such tests must not 
Yield indices depending on very fine and small age classifications. 
On the other hand, all that has been said makes it clear why tests 
Constructed and standardized primarily for young people in school 
May well show an early age of arrest, and why they tend to suggest 


a false picture of mental growth beyond the late teens. » 


3. The growth curve 


For psychometrics the most fundamental question in this whole 
area is whether it is possible to construct a curve representing the 
tue average or “normal” course of mental growth over a con- 
Siderable age range. If this could be done, it would be of the 
‘ghest importance. Test performance could be interpreted by 
dices and scores that would show known increments and levels 
oi mental growth, instead of by mental age scores or standard 
Scores whose developmental values are uncertain. It would be pos- 
Sible to tell whether a person’s mental development was proceeding 
More or Jess rapidly than normal, and whether he had reached a 
Stage of maturity at, or above, or below the expected norm for his 
age. This very thing, as we have seen, has been attempted by 
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Fic, 28. THREE GROWTH CURVES FOR THE SAME Test (NATIONAL 
INTELLIGENCE Test). A. on raw scores: B. on S.D. scores: C. on S.D. 
Scores from absolute zero. Freeman (1929), Fig. 10: Odom, Fig. 6: Thur- 


stone (1928), Fig. 8. 


e developed by Heinis (q.v.) 


Kuhlmann, who accepted the curv 
ntal development. 


s an authentic chart of normal mer 
t is easy enough to run a test or a battery of tests at a number 


of different ages, and then to plot a curve through the mean scores 
Obtained for those ages. But to accept such a curve as a true 
representation of mental growth is quite another matter. Three 
Curves so derived are shown in Figure 28. They are all for the 

ational Intelligence Test—and they are all different. The chief 
reason is that they are all three based on different statistical 
treatments of the test data. Freeman’s curve shows changes in 
average raw scores (v. Freeman, 1921). His procedure in this 
respect has been criticized by Odom (q.v.) and J. Peterson (1922). 

he objection is that equal differences in raw scores may not 
Correspond to equal differences in performance, Thus it may be 

Uch easier to raise a score from 50 to 60 than from 120 to 130, 


and so on, 
In order to meet this difficulty, Odom took the mean score of 
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the lowest age group as the point of origin for his curve, and the 
standard deviation of this group as the unit of measurement. The 
raw scores of all other age groups were converted into this unit 
by the method already explained, the basic idea being that the 
increment of performance or difficulty expressed by any given 
difference between raw scores would be considered equal to that 
registered by the lowest age group. Then averages for all age 
groups were worked out in terms of these derived scores, and the 
curve plotted as shown. 

Thurstone (1928) introduced a distinctive feature into his treat- 
ment of the test results. He contended that an authentic curve of 
mental growth must start from a true zero point as its origin. 
This would be a point where there is no intelligence or mentality 
at all. But no test shows this true zero, for a person might very 
well score zero on it and still, in fact, have considerable intelli- 
gence. Thus a theoretical construction became necessary. This 
turns on the fact that when a test is run on groups of varying ages, 
the variability or spread of the scores shows a regular increase 
with increasing age. The relationship between the increasing varia- 
bility of the scores and the increasing age of the subjects is not 
perfectly uniform, but one can assume that it would be under 
ideal conditions including perfect test reliability. Thurstone, then, 
assumed that the failure of an absolutely uniform relationship 
between variability and age to appear in the actual results came 
from errors due to known causes, for which consequently allow- 
ance could be made. This he proceeded to do, and was then able 
to express the constant or uniform age-variability relation as a 
mathematical equation. When this was accomplished, he could 
readily determine at what age variability would become zero. And 
since variability increases as age increases, and decreases as age 
decreases, this zero point would be the true zero of intelligence as 
measured by the test. So the Thurstone curve, which is also plotted 
in terms of averages of derived scores supposed to represent equal 
units of test performance, differs most importantly from the Odom 
curve in having a differently determined point of origin. 

Which of the three curves, all based on the same test, and all 
different, authentically represents the course of mental evel 
ment? This, clearly, is the question. One further example will 
make it even more cogent. The ordinary assumption, apparently 
endorsed by Terman and his associates, is that the Stanford-Bine 
scale shows a negatively accelerated pattern of mean performance 
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That is, it shows a development that is at first rapid and then 
Slows down, so the earlier mental age unit steps are really further 
apart than the later ones. But Thurstone and Ackerson (q.v.), 
applying to it the method described above, derived a curve show- 
Ing positive acceleration up to the age of 10, i.e., showing a de- 
velopment which begins slowly and then speeds up. So, again, 
Which picture is correct? . 
_ The answer is—we cannot say. Statistical ambiguities make this 
Inevitable. To plot a curve on raw test scores as Freeman did is 
Certainly open to objection, because equal differences between raw 
Scores almost certainly do not stand for equal differences in test 
Performance. Moreover, this method determines no zero point for 
an origin. But the rectifications by Odom and Thurstone, while 
they may be statistically justifiable, involve so many formal 
assumptions that the psychological meaning of the outcome is 
Impossible to determine. For this reason the position taken by 
€rman seems entirely justifiable when he says that nothing cer: 
tain is known about the curve of mental growth, and that it cannot 
e used as a psychometric tool. . 
These difficulties are what might be called formal or logical. 
But there are psychological difficulties too, which are even more 
important. On all the evidence, mental growth is anything but a 
Simple unitary process. On the contrary, as we have seen, it is com- 
Plex, diversified composed of shifting and interlocking rhythms, 
Characterized by the sudden emergence of new functions and the 
Ong delay of others. A very sound argument can be made for 
enying that it is a linear process at all. If this is so, it cannot be 
represented by any single curve, no matter what devices of statis- 
tical analysis are used. Indeed, the consequence would be that 
the better analysis became, the further it would move from the 


Sort of simple representation shown above. 


4. Conclusion 

This contention is fraught with significance for psychometrics. 
It means that scores such as the I.Q., the M.A., or standard scores 
are Simply convenient statistics, meaningful because they work 
and because they possess an indubitable psychological content. 
indices of a unitary mental develop- 


they are not adequate i aes eye’ 
Ment, and cannot be converted into such indices. Also, this points 


© way for the future development of mental testing. The sug- 
8€stion js that it should move towards greater differentiation, and 
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the construction of measuring instruments of more diversified 
kinds and for more diversified purposes, capable of dealing better 
with the immense variety of human mentality and development. 
The attempt to build tests and derive norms about a unitary 
linear sequence of mental development seems foredoomed, because 
in all probability of the hypothesized phenomenon does not exist. 


Tue DISTRIBUTION or MENTAL Traits 


Mental traits are usually assumed to be normally distributed 
in unselected populations unless there is specific evidence to the 
contrary. The importance of this assumption lies in the general 
use of the normal curve as a psychometric instrument. One of its 
typical uses may be explained by taking an hypothetical case. 

Consider three scores of 50, 75, and 100 on Army Alpha. From 
these three raw scores alone nothing is known except their rank 
order. Whether they are high, medium, or low in terms of test 
performance, and whether 75 is as far above 50 as roo is above 
75 in terms of test performance are not known. But if the mean 
and the standard deviation of the standardization group are com- 
puted, and turn out to be 75 and 25 respectively, then 75 is average 
test performance, and the three scores become —r, o, and +1 in 
S.D. values about the mean. They fall at the indicated points in 
Figure 29A. 

The two differences—between.50 and 75, and 75 and roo in 
raw scores, or between —r1 and o, and o and +1 in S.D.’s—are 
equal if the distribution of the scores of the standardization group 
is normal, as in Figure 294. In this case there will be as many 
scores intervening between —1 and o as between o and +1. Put- 
ting it otherwise, the score of roo shows as much increase in dif- 
ficulty beyond 75 as 75 shows beyond 50, in terms of the perform- 
ance of the standardization group. But if the distribution is. not 
normal, as in Figure 29s, this will not be so. Here considerably 
more scores intervene between +1 and o than between o and —1. 
So the two differences do not show equal difficulty. 

Jf the form of the distribution is known, this inequality, al- 
though inconvenient, is not important. The reason is that its extent 
is known, and can be allowed for, just as one can allow for the 
unequal units of a Mercator projection map. With the standard- 
ization group itself the form of the distribution can be determined, 
for we have the raw scores before us for inspection. In most test 
construction, standardization groups yielding approximately nor- 
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mal distributions are chosen. But this group is used as a sample 
of the much larger population to which the test is intended to 
apply. The real question is whether the two raw score differences 
indicate equal real differences in test performance or difficulty for 
the entire population and not for the sample only. But- here the 
form of distribution of scores can never be directly ascertained 
unless the whole population takes the test, which is impossible. So 
it is necessary to make an assumption, this being that the distribu- 
tion is normal. Before considering how this assumption is justified, 
three comments must be made. 

A. The normal curve is not any bell-shaped curve. It is the 
curve of the equation 


y= 


—=€ 20 
OV 20 
where x is the abscissa, y the ordinate, x and e are constants, 
o the S.D. of the distribution, and N the number of cases. Glib 
talk about the normal curve, as if it could be identified at 4 
glance, is quite fallacious. 

B. There is no universal law of nature that events must dis- 
tribute themselves or tend to distribute themselves in a normal 
curve. As Walker (q.v., p. 168) put it: “A variable which is the 
resultant of several equally potent causes, each as likely to be 
present as absent in any given instances, has a normal distribu- 
tion.” If, for example, one tosses a handful of unbiased pennies, 
the coins are independent of each other, and each is equally likely 
to fall heads or tails, so the number of heads is subject to the 
probability distribution. Random errors in measurement, as in 
the physical sciences such as astronomy, appear to be normally 
distributed. The distribution of adult human stature is probably 
though not certainly normal. The distribution of adult human 
weight is not, because it is subject to deliberate nutritional influ- 
ences which disturb pure chance. The distribution of achievement 
in some subject in school classes is certainly not normal, because 
it is the deliberate purpose of the teacher to introduce influences 
extraneous to chance. 

C. The whole value of the normal curve turns on its mathe- 
matical characteristics. When pennies are tossed, it is possible tO 
calculate the probability of 100, or 80, or 50 heads out of a pos- 
sible 100 appearing. When measurements are made in an observa- 
tory, it is possible to calculate the probability that an error of 4 
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Certain magnitude will occur. In-both cases this is because the 
Phenomena are subject to the probability distribution. So, in the 
hypothetical instance from mental testing that was discussed, we 
can say exactly what the probability is of a person making a 
Score of so or 100 if we know that these scores are 1 S.D. below 
and 1 S.D. above the mean, and if we also know that the distribu- 
tion of the test ability in the functional population is normal. This 
is the assumption constantly made in working out interpretive 
norms for mental tests. Its basic importance for psychometrics is 
obvious. What, then, is the foundation for it? 


1l. Reasons for and against the assumption of normality 
l traits are normally distributed 
a circumstantial argument, for it 


cannot be directly verified. That argument has been formulated 
With great clarity and explicitness by Thorndike (v. Thorndike 
and Others). It will first be summarized, and then objections to 
lt will be considered. 


A. The argument runs as follows. 
(a) Scores on many tests given to large populations distribute 


themselves in bell-shaped symmetrical curves which, although not 
Precisely normal, are nearly so. The approximation to normality 
1S close enough so that the values computed from these distribu- 
tions will not contain serious errors. A sample of such distributions 


1S shown in Figure 30. 


The assumption that menta 
rests upon what is essentially 
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(b) If the raw scores on these tests are converted into scores 
based on the standard deviations in order to equalize the units, 
the approximation to normality becomes still greater. Such distri- 
butions of converted or transformed scores are shown in Figure 31. 

(c) It might be said of some given test that it was constructed 
and its items arranged so as to produce an approximately normal 
distribution. If so, the appearance of such a distribution would 
prove nothing about the distribution of the ability it purports to 
measure. It would be artificially produced by manipulation. But 
Thorndike finds such distributions in the scores not of one but of 
many tests, and not at one but at two age levels—specifically the 
6th and oth grades. Furthermore, composite scores on all these 
tests together can be obtained after the units have been equalized. 
These composites also show an approximately normal distribu- 


tion, and they show it at three age levels, for the 6th grade, the 
oth grade, and college freshmen. > 


(d) Itis on these.facts that Thorndike rests his case. Whatever 
the arguments pro and con, approximately normal distributions 


of raw and derived scores do persistently appear for many dif- 
ferent tests and on many age levels. 


B. Various criticisms of the logic of this argument have been 
made by Thomas (q.v.) and by McNemar (1942). Theis state- 
ments represent a considerable body of opinion. 

(a) As to the use of raw scores, it is pointed out that they may 
not constitute truly equal units. If this is so, then the resulting 
distribution will be distorted, just as a Mercator projection map 
of the world is distorted. Thorndike recognizes this possibility, 
but his reply is that normal distributions do in fact result, and 
in many varied situations. In other words, he answers a theoretical 
objection with an actual finding. 

(b) It is said that when raw scores are transformed into stand- 
ard deviation scores, they are forced into a normal distribution. 
If this were so, then the appearance of normality in Thorndike’s 
derived score distributions on single tests, and also in his com- 
posites, would be spurious, and produced by statistical manipula- 
tion. But it is not correct. Transformati „n into standard deviation 
scores does not “normalize” a distribution. This can be done, but 
another procedure is necessary, which Thorndike does not adopt- 
He transforms his raw scores into standard deviation scores in 
order to equalize his units, the assumption being that the dis- 
tribution is normal. But that assumption is based on the observed 
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fact that the distribution of the original raw scores is in fact 
approximately normal. 
x Thus the argument may be circumstantial, but it is not falla- 
Clous. Thorndike has been charged by Thomas and McNemar 
with stacking the deck by first forcing a normal distribution on 
his data trough conversion*into standard deviation units, and 
then discovering a normal distribution. This is simply not true. 
And to the difficulty that the raw scores may represent unequal 
Values, which is possibly so, he counters with ascertained facts 
Whose impressive persistence cannot be gainsaid. , ; 
_C. Quite apart from such purely logical or systematic objec- 
tions to the assumption of normality, various difficulties have been 
found on what may be broadly considered psychological grounds. 
(a) Kuhlmann (1939), among others, has pointed out that 
uman mating is by nc:means nonselective, i.e., a matter of pure 
chance. This, he cortends, is a reason for believing that human 
mental traits, which are dependent to some extent, and perhaps 
Considerably, upon heredity, may not be normally distributed. 
‘he fact itself, no doubt, is true. Human mating Is certainly con- 
ditioned by socioeconomic status, for example. And it is known 
that the degree of resemblance between husbands and wives in 
e matter of intelligence is about the same as that found be- 
tween fraternally related persons, i.e., brothers and sisters. But 
Its effect upon heredity is not at all clear, Adult standing height, 
or example, is normally distributed, although it seems to be 
argely an hereditary trait. So there seems no indubitable reason 
Why selective mating should disturb the normality of the distribu- 


ton of mental traits. A 
(b took a revision of what he calls the 
cg‘) Symonds (1923) under f distribution of intelligence.” 


rst approximation of the curve 0 aa e > 3 
is is i approximately normal curve of distribution of intelli- 


Bence quotients as reported by Terman. In making his revision, 
YMonds took the data on occupational intelligence levels as 
Worked out and tabulated by Fryer (1922) from the apogee 

€ Army testing, to which reference has already been made. He 
then plotted the ‘distribution of intelligence for the nine major 
Occupational groups set up in the roro census, and combined 
them into a composite curve. This curve he showed to be heavily 
Skewed towards the low end, that is, to indicate a great many 

Ore persons on the lower as contrasted with the higher intelli- 
Bence levels. The reason is that there are far more persons in the 
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s Fic. 31. THREE APPROXIMATELY NORMAL DISTRIBUTIONS OF DERIVED 
Cores: A at sixth grade level; B at ninth grade level; C at college fresh- 
man level. Thorndike and others (1927), Figs. 87, 88, 122. 


occupations of relatively low status with regard to intelligence. 

There seems to be an unexplained contradiction here. The 1.Q.’s 
Of the standardization group of the first Stanford Revision dis- 
tribute normally. This group consisted of about 1,000 subjects and 
included all native white children in a community who were 
ies a certain number of months of a birthday. The 1.Q.’s of 

he standardization group of the second Stanford Revision also 
distribute normally. And this group was not only much larger, 

ut was chosen to parallel the occupational subdivisions of the 
Census. Yet on Fryer’s data this ought not to happen, because of 
the large surplus of persons on the lower occupational-intelligence 
levels, Until further analysis has been made, this would seem to 
raise a certain doubt. 

(c) Innate capacities are more likely to be normally distributed 
than those associated with environmental factors. Apart from any 
Question of cause and effect it seems certain that mental traits 
are associated with amount and kind of schooling, socioeconomic 
Status, home conditions, and so forth. This again raises a doubt 
as to the normal distribution of mental traits. 
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(d) We have noted a tendency to develop what have been called 
“special purpose” tests, such as the Medical Aptitude Test or the 
Towa Placement Examination. These, we have argued, are essen- 
tially mental tests slanted, standardized, and used for special func- 
tional groups. The contention has been that they reveal genuine 
mental traits in the special setting of the functional group rather 
than special aptitudes in the strict sense. But the assumption has 
always been that mental traits are normally distributed in an un- 
selected population. If such special purpose tests become more 
common, which is very likely, a problem will be created, for 
the traits which they measure may not be normally distributed in 
the functional populations for which they are designed. This can 
only be proved one way or the other by some such investigation 
as that of Thorndike. But if it should turn out to be the case, then 


all psychometric techniques based on the assumption of normality 
would be disallowed with tests of this kind. 


2. Conclusion 


The argument for believing that intelligence is normally dis- 
tributed in an unselected population is circumstantial and indi- 
rect. But it is a strong one. It seems to contain no logical fallacies, 
and it is the only type of argument that can be used, since direct 
demonstration is not possible. The belief that other mental traits 
are normally distributed depends chiefly on analogy with intelli- 
gence, and is much less secure. Above all, there is no a priori 
reason why they should be or must be, or why they should not be. 
There are some psychological difficulties in the way of the assump- 
tion so far as intelligence is concerned, but they only indicate 
possibilities, and the positive argument is supported though not 
decisively established by massive facts. The general upshot is 
that makers of many tests have been far too free and easy about 
assuming a normal distribution without carefully scrutinizing 
their data to see whether the belief could be supported. The reason 
is obvious. Once the distribution is known, the computation of 
sound interpretive norms becomes a straightforward business. But 
this is not an argument, although the assumption may be ap- 
proximately correct and probably quite often is. So the student 
should understand that this chief girder in the foundations of 
psychometrics is not as secure as numerous confident statements 
might lead him to suppose. It certainly cannot be taken as an 
axiom. 


WIDER PSYCHOLOGICAL ISSUES 373 


HEREDITY AND ENVIRONMENT 


1. The position 


There is no direct knowledge regarding the inheritance of 
human mental traits. It is a problem on which controlled direct 
investigation is virtually impossible. Such evidence as there is 
available is indirect, and a large proportion of it has already been 
discussed above in connection with other topics. 

It is established that there is a definite family resemblance in 
mental traits. This resemblance, so far as average trends are con- 
cerned, is proportional to the closeness of the blood relationship. 
Also, it tends to be transmitted from generation to generation. 
The Jonathan Edwards family, for instance, produced a long 
series of distinguished persons. Low-grade mentality and feeble- 
mindedness, too, have been found to appear generation after 
generation in the same family stock. These results are certainly 
Suggestive. But they are hardly more, because the general environ- 
mental setting of a given family is probably more or less constant, 
and may well account for some part of the resemblance. It is very 
Probable that innate factors are influential, but just how great 
their influence is cannot be determined. This is the more manifest 
when it is recalled that changes in domestic environment, such as 
those produced by adoption, seem to affect mentality at least to 
some extent. i 
- As has been shown, both the level of mentality and changes in 
ar to be associated with socioeconomic status, type 
ation of schooling. One should be very 
he existence of a causal relationship, 
‘actors in the socioeconomic pattern 


and the educational setting that go with high mental level or the 
improvement of mentality are not clear. All that is known is that 
general favorable conditions of life, and some fairly specific favor- 
able general influences such as those of preschool attendance go 
With superior mental ability, and sometimes and for some persons, 
though not always and for all persons, are connected with an 
improvement in mental test performance. Also the converse is 
true. In fact, unfavorable general and special conditions seem 

ated with a cumulative depression of mental 
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level than are favorable conditions with its cumulative improve- 
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The fact that mental traits and abilities are within certain limits 
constant also suggests that heredity is influential. But once more 
the connection between the inheritance of mental traits and their 
constancy is not certain, because, as Kelley (1927) has pointed 
out, various traits known to be acquired may also be very per- 
sistent and hard to alter. Moreover, the constancy of mental traits ~ 
is only relative. It is always tempting to speak, as Cobb (q.v.) 
does, about the limits set to achievement by limited intelligence, 
and no doubt they exist. The gist of Cobb’s principal argument is 
that continuation in school and school performance are rather 
closely associated with intelligence as revealed by tests. But one 
must remember first that the school environment is biased in 
favor of those who do well on intelligence tests and against those 
who do not, and second, that many persons could step up their 
achievement to an undetermined degree if they were taught with 
the greatest possible expertness. Limits no doubt there are, but 
the idea of a hard-and-fast ceiling is a great oversimplification. 

As to the relationship between race and mentality, the data ate 
affected by so many doubtful factors that no conclusions can be 


safely drawn about what it implies for the influence of heredity 
and environment. 


This seems to be a fair statement as to the present position. It 
is anything but satisfactory, and there is little exact or reliable 
knowledge. Because of this there has been a tendency to dismiss 
the whole question as academic. But, as we shall see, certain 
important inferences can be drawn. 


2. Implication for psychometrics 


A. The general issue has been formulated by Thomas ( q.v.). He 
argues that three basic psychometric assumptions essentially em- 
body an hereditarian interpretation of mentality. (a) A test 
measures a single unitary factor, or at least purports to do so, or 
undertakes to do so. This may be general intelligence, or some 
primary mental ability, or what not. (b) Inevitably a test makes 
use of environmental material, such as language, numbers, shapes, 
colors, mazes, mechanical devices, and so forth. But the incidental 
acquisition of command of such material is proportional to men- 
tality. (c) For each individual mentality develops uniformly 
from year to year. 

If we ask whether such an interpretation is borne out by our 
psychological data, the answer can only be in the negative. Thus 
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Gesell and Amatruda (qg.v.) find that the environmental retarda- 
tion of mentality is an undoubted fact. It is often found in young 
children brought up in a large impersonal institution, even when 
lt is well run and managed. They list 12 recognizable symptoms 
St the environmental retardation of mental growth (p.. 291). 
(1) Diminished interest and reactivity, appearing about the 8th 
to rath week lof life]. (2) Reduced integration of total behavior, 
about the 8th to r2th week. (3) Beginning retardation evidenced 
by disparity between exploitation in supine and sitting positions, 
from the rath to the 16th week. (4) Excessive preoccupation with 
Strange persons, from the 12th to the 16th week. (5) General over- 
all retardation of function, appearing from the 24th to the 28th 
Week. (6) Blandness of facial expression, from the 24th to the 
28th week. (7) Impoverished initiative, from the 24th to the 28th 
Week, (8) Channelization and stereotypes of sensori-motor be- 
lavior, from the 24th to the 28th week. (9) Ineptness in new social 
Situations, appearing from the 44th to the 48th week. (10) Exag- 
8erated resistance to new situations, appearing from the 48th to 
the send week. (rr) Relative retardation in language behavior, 
appearing from the 48th to the 52nd week. (12) Definite improve- 
Ment with improved environment.” Similiar effects can undoubt- 
edly be identified in older subjects also, such as canal-boat 
children, orphanage children, and the like. A mental test which did 
Not register changes of this kind, and which treated them merely as 
Sources of variable error, would not be penetrating down to some 
Unchanging mental substratum. It would simply be a bad test. 
nd of course a good test should reflect favorable effects as well. 
Let us consider how this bears upon the three assumptions 


Whose hereditarian basis is alleged by Thomas. — , 

(a) A test must always be built about a certain concept. This 
Concept is translated into test items, and item performance is 
‘valuated in terms of the performance of a standardization group. 


ut it need not be understood as corresponding to a unitary 
ly a category for the 


Mental factor or entity. It is simp t tor 
classification and better understanding of behavior, valid in so 
ar as it works, like the psychiatric categories, for instance. Gen- 
ral intelligence as a concept has on the whole worked well, and 
any of the endeavors to improve test construction turn on the 
Psychiatrists presumably 
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ences. The same attitude should be taken regarding psychometric 
concepts. Without such concepts the exploration of mental life 
would be impossible, and if they are crude or erroneous, the 
purpose should be to improve them. 

(b) As to the environmental content of tests, the point is not 
to try to eliminate it, which would be impossible. Nor is it neces- 
sary to assume that the incidentally acquired mastery of such 
material is directly proportional to mental level. Mentality is not 
something separate from environmental material. It expresses 
itself and indeed exists in it. As Stoddard (1943) puts the matter, 
the true purpose is “to provide materials so that differences in 
experience which do not contribute to differences in intelligence 
will be minimized” (p. 115). In so saying he is pointing towards 
the special purpose intelligence test, in which the concept is in 
effect defined in terms of some functional group, and the test 


material selected to play up this limited but intelligible concept 
as Clearly as possible. 


(c) As to the assumption that, for each individual, mentality 
grows uniformly from year to year, this is not necessary at all. 
If we have a test properly built about a good working concept 
which furnishes a guide-line for understanding behavi 
in growth-rate can show up in the scores 
changes, which is just as it should be. 

Thus psychometric practice does not depend upon or imply an 
hereditarian position. 

B. A second very important implication of our general position 
regarding heredity and environment is that the more precise 
analysis of the environment is a major and demanding task. The 
term is usually used in a broad and sweeping sense, but it covers 
a multitude of pertinent differences. Which factors of the environ- 
ment are prepotent and active, and which are inert in their effect 
upon mentality? The question is only now beginning to come up 
for analysis, but it is obviously pertinent. Wellman (1940 b), for 
instance, points out that most socioeconomic scales concentrate 
upon the physical and material aspects of the home, and that 
these may not be the psychologically decisive ones. A start at least 
towards improvement is to be found in the Minnesota Home 
Index (v. Leahy, 1936), which rates in terms of six major divi- 
sions; namely, Children’s Facilities, Economic Status, Cultural 
Status, Sociality, Occupational Status, and Educational Status. 
As has been pointed out in these pages, parental income within 


or, changes 
just like any other 
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fairly wide limits is probably inert in its relationship to mentality. 
Or again, there is reason, as we have seen, to believe that certain 
Kinds of school experience are associated with improved mental 
Status. But we only know in a general way what kind, and the 
identification of the active and prepotent factors is only a matter 
of reasonable guess. So once more, the concept of chronological 
age covers a multitude of differences which probably affect men- 
tality on the principle of fifty years of Europe being worth a cycle 
of Cathay. 

Furthermore, environment cannot be evaluated in isolation, 
but only with reference to the responding individual. The same 
environment almost certainly does not have the same effect on 
Persons of different mental levels and dispositions, and of dif- 
ferent ages. A bright and sensitive child will quite possibly be 
Stavely depressed by an environment which an average or below- 
Average child tolerates quite well. 

hese are some of the questions that press upon us In view of 
what seems to be a reasonable position regarding the influence 
of heredity and environment. An outright hereditarian assumption 
1S quite impossible. Neither is it possible to determine the pro- 
Portional influence of innate and environmental factors, although 
this has from time to time been attempted. Thus Leahy (1935) 
in her study of the psychological effect of placement in foster 
omes finds by means of a statistical analysis of her data that 
Specific home environment accounts for 47 of changes in the 
Mentality of the adopted children who were her subjects. Newman, 
freeman, and Holzinger (q.v.), again, in reporting their work 
On r9 pairs of identical twins raised apart, find that social and 
economic influences account for 72% of the divergences which 
appeared when all rọ pairs were considered, but that when the 
4 most extreme cases were eliminate, the figure falls to 20%. 
umerous other investigators, including Hirsch (1930) and Burks 

1928), have taken the responsibility for similar statements. As 
demonstrations of statistical analysis they are interesting, but it 
'S doubtful whether they have much assignable general meaning. 

n any case the point is not to adjudicate a partisan competition 
tween heredity and environment, but to discover what influ- 
ences are prepotent, how much change they can produce, and why. 
nd this does not imply the least disparagement of psychometrics, 

Ut rather the contrary, for without its techniques and instru- 

Mentalities such investigations would be impossible. 


378 PSYCHOLOGICAL TESTING 


Tue PSYCHOLOGICAL SIGNIFICANCE oF TEST SCORES 


The best-known interpretive classification of test scores is that 
of Terman, which is shown in Table 55. As will be seen, it attaches 
more or less definitely meaningful characterizations to various 
I.Q. levels. Many objections to it have been made by clinicians, 
guidance officers, workers in applied psychology, and others who 


TABLE 55 
INTELLIGENCE Crassiriep By I.Q. LEVELS 


(Terman, 1916, p. 79) 


LQ. Classification 
Above 140 ..... “Near” genius or genius 
120-140 ....+...| Very superior intelligence 
IIO-120 .......| Superior intelligence 
90-110 Normal or average intelligence 
80-90 ++-|  Dullness, rarely classifiable as feeble-mindedness 
JOBO s sinen Borderline deficiency, sometimes classifiable as dull- 
ness, often as feeble-mindedness 
Below 70 ...... Definite feeble-mindedness 


have to interpret the scores in terms of practical action. Particu- 
larly his use of the words genius and feeble-minded has been 
criticized. A genius, it is pointed out, is a great deal more than a 
person with an I.Q. of more than 140. And some individuals who 
are usually considered geniuses have had intelligence quotients 
below this level. Similarly, it is said that feeble-mindedness can- 
not be defined simply in terms of mental level as established by 
tests, although, as will be noted, Terman himself qualifies this 
classification. Terman (1921), in reply, has said that intelligence 
quotient classifications are not to be considered clinical groupings, 
but this answer has not been found quite sufficient. In connection 
with the second Stanford Revision, Merrill (1938) has re-edited 
this scheme of classification. The I.Q. level 70 to 79 now comes 
to mean “border-line defectives,” while those below 70 indicate 
true defectiveness. Thus the term feeble-mindedness disappears, 
but the scheme in essence still stands. 

Wechsler (1944) has set up a somewhat different scheme of 


aeeEeEeEeEeEeeeeeEE—E——E—————— = 
1? 


WIDER PSYCHOLOGICAL ISSUES 379 


sassification, with less positive designations. This is shown in 
able 56. It will be noted that Wechsler drops both the term 
8enius and the term feeble-minded, with their far-reaching and 
uncertain connotations. Also, he has proposed still another scheme, 
which is purely statistical, based on nothing but the portions of 


TABLE 56 
INTELLIGENCE CLASSIFIED ACCORDING To I.Q. 
(Wechsler, 1944, Table 4, p. 40) 


Classification IQ. Limits | Percent Included 
DARNE amtaa e 65 and below 2.2 
BOiden Line sx sais aan pa was wt ..| 66to 79 6.7 
Dull Normal .. 80to 90 16.1 
Normal Per ..| gr tor10 50.0 
Bright Normal .......0000eeeeeeee III to 119 16.1 
Superior s sses .| 120 to 127 6.7 
Very Superior 128 and over 2.2 


the total distribution of I.Q.’s and the percentage of the total 
number of cases included in each category. He still attaches de- 
Scriptive terms to these classes, but the essence of the scheme is 
Categorization in statistical terms. This arrangement is shown 
In Table 57. 
TABLE 57 


SratistrcaL Basis oF LQ. CLASSIFICATION 


(Wechsler, 1944, Table 3, p. 40) 


Classification Limits in Terms of P.E. | Percent Included 
Defective ae —3 P.E. or less 2.15 
Order Line —2 to —3 P.E. 6.72 
l Normal —1 to —2 P.E. 16.13 
Bie —rto +1 PE. 50.00 
u 2 5 Normal ... 4-1 to +2 P.E. 16.13 
eas +2 to +3 P.E- 6.72 
TY Superior +3 P.E. and above 2.15 
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1. Test scores and social meanings 


All this well defines the problem of attaching psychological 
significance to test scores. It is, from one point of view at least, 
to determine and define the actual capacity for effective social 
behavior indicated by a given score, or a score within a given 
range. 

A. Feeble-mindedness is a psycho-social rather than a psy- 
chometric category. The British Mental Deficiency Act defines 
an idiot as one who is unable to protect himself against common 
physical dangers, an imbecile as one who is unable to communi- 
cate by written language either through reading or writing it, and 
a moron somewhat less clearly as one who needs care and super- 
vision for the protection of himself and others. These are degrees 
of feeble-mindedness, roughly described in terms of social be- 
havior. A given mental test score may indicate one of these classi- 
fications, but it is only one criterion and needs to be supplemented. 
Some legal definitions of feeble-mindedness, framed with a view 
to institutional commitment, specify an I.Q. of less than 75, others 
an I.Q. of less than 70. Wechsler himself has pointed out that two 
children may have the same I.Q. and yet require very different 
treatment, and that of two persons with an L.Q. of about 75, one 
may be definitely defective while the other is not. Also, Doll (1923) 
insists that feeble-mindedness is a social category in which some 
persons in the I.Q. range from 65 to 75 do not belong, while others 
between 75 and 80 do. Moreover, a given intelligence quotient 
may mean dependency for a white but not for a Negro, or for a 
boy but not for a girl. The reason for this second differentiation 
is again psycho-social rather than psychometric, because women 
of low mentality are likely to receive sufficient care and support 
so that they need not be institutionalized, whereas men of the 
same mental level are not. It is still an open question whether on 
mental test scores boys are more variable than girls (McNemar 
and Terman; Kuznets and McNemar). Thus social rather than 
psychometric considerations determine the classification. 

B. As to the gifted individual, a fairly clear and complete pic- 
ture has been developed by Terman (1925) in his very extensive 
studies in genius. The general outcome of this work has been 
compactly summarized by Pintner (1931), whose account is aS 
follows: “About 1000 children above I.Q. 130 were selected for 
study, and these compared with children whose T.Q.’s are normal. 
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It was found that among the gifted the ratio of boys to girls was 
higher than that in the general population. In racial origin these 
California children were found to be mainly of Western European 
and Jewish stock. The Jewish stock contributed about twice that 
€xpected from the total Jewish population of the areas investi- 
8ated. The average social status of the families was much higher 
than that of the average family. In general the family incomes are 
fair, and they live in superior neighbourhoods, but there are iso- 
lated cases from very poor families living in inferior neighbour- 
loods. These children come from families where there are dis- 
tinguished relatives in much greater proportion than would be 
found in the average family. The vital statistics of the families 
Show a healthier than average stock, with few cases of insanity or 
€eblemindedness. The anthropometric measurements show the 
Sifted group physically superior. The medical examinations show 
them also superior to average children. In school progress they 
are 14 per cent of their age above the norm in grade location, 
and 48 per cent of their age above the norm in intelligence, so 
that they are under-promoted to the extent of 34 per cent. Their 
School marks are better than those of ordinary children. On 
Standard educational tests the E.Q.’s of the gifted are high, but 
Not as high as their 1.Q.’s. The gifted are no more uneven in their 
Schoo] abilities than ordinary children. Their occupational ambi- 
tions are higher than those of the control group. In general they 
ave the same type of interests as ordinary children. They make 
More Collections, particularly of a scientific nature. Their play 
nterests are in general like those of the control group, with a 
Somewhat greater interest in plays that require thinking. They are 


More mature i ir play interests, showing a greater liking for 
quieter and shoes ie amet These gifted children read a great 
eal more than does the average child. The average gifted child 
7 reads more books in two months than the average control 
Child up to age rs, and the range of reading is much wider. In 
Character and personality they are very superior, about 85 per 
“ent of the gifted being above the median of the control group 


(PD. 361-62). 
n the same way, bu 
and Hollingworth (q.v.) followe 


t in an investigation of less scope, Lorge 
d up a group of very superior chil- 

ren who had reached the ceiling of the LE.R. Intelligence Scale 
CAvp during secondary school, and of whom some had I1.Q.’s in 
Xcess of 170. At college age they were found to have carried on 
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research in history, mathematics, and chess, and two of them were 
established in the learned professions. . 

Thus there can be no doubt that a high intelligence score is an 
indicator of general high level social behavior. But as Stoddard 
(1943) as well as many others have pointed out, the statement 
that it indicates genius is most misleading. The performance of 
the superior group studied by Lorge and Hollingworth is no doubt 
highly creditable and unusual, but it has none of the distinctive 
marks of supreme creative achievement. And an I.Q. of 140, far 
from indicating genius, is actually reached by a not inconsiderable 
proportion of American college students. 

C. It has been proposed by Darsie (q.v.) to interpret intelli- 
gence scores in terms of educational prospects and promise. Thus 
an LQ. of 115 and over would indicate a good college prospect, 
the range from 100 to rr4 would indicate education through 
secondary school only, from 80 to 90 would indicate placement in 
a vocational high school, 70 to 79 placement in an industrial high 
school, and below 70 a special type school. Educational guidance 
workers might find intelligence ratings so interpreted helpful, but 
probably would not apply them rigidly in individual cases. 

D. One point that must be borne in mind is that comparatively 
small differences in test performance may involve enormous dif- 
ferences in social potential. An outstanding result of psychometrics 
has been the demonstration of a very great variety of individual 
differences. But Wechsler (1935) has shown that these differences 
are by no means so extreme as is sometimes supposed. He brought 
together virtually all the data that had been published up to 1933 
and: analyzed them to show the linear extent of individual dif- 
ferences. In order to do this he studied the distribution of endow- 
ment on 8ọ traits, physical, motor, and mental, cutting off the 
top and bottom .1% of his groups to avoid rare extreme variants 
which would disturb his averages and make them misleading. The 
actual linear range of ability on these traits was smaller than 
might be expected. For height the total variation lay between the 
proportions of 1.22:1 to 1.40:1, with the mean for all studies 
1.30:1. All motor traits together had a range in the proportion 
from top to bottom of from 1.65:1 to 2.50:1, with a mean total 
range of 1.30:1. Perceptual traits together had a range from top 

to bottom from 2.30:1 to 2.85:1, with a mean of 2.58:1. General 
intelligence as measured by M.A. for children of C.A. from 9 tO 
9.99 years had a mean range of 2.30:1. Hard learning in substitu- 
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tion tests measured in seconds for 766 boys 14 years old had a 
mean range of 3.87:1. This would seem to dispose of the claims 
sometimes found that human beings are likely to show such ex- 
treme differences in respect of this or that trait that some will be 
ten, twenty, or more times as able as others. As Wechsler points 
out, the human race is much more homogeneous than some other 
organic types, such as dogs or trees. And one consequence is that 
Comparatively small differences in test performance may point to 
enormous differences in prestige and social and general success. 


2. Quantity and quality 
_Another way of looking at the problem of 
Significance of test scores is to think of it as 


qualitative meanings to quantitative indices. 
Thus a mental age score of 10 attained by a person whose 


chronological age is fifteen indicates a different kind of mind from 
that of a person whose M.A. and C.A. are both 10. The former 
will probably excel the latter at muscular, motor, and routine 
Performances including memory, whereas the latter will excel the 
former in verbal discrimination and linguistic and numerical per- 
formances involving high-level organization (Merrill, 1924; E. B. 
Greene). Terman (1906) found just such differences between 
Seven “bright” and seven “stupid” boys. 

Also, mental test performance reflects personal and tempera- 
mental type. We have seen that there are characteristic differences 
in Stanford-Binet and Wechsler-Bellevue profiles for different 
Psychotic categories. Wells and Kelley (q.v.) found the perform- 
ance of psychotics on vocabulary and digit memory stood up well, 
but that there was marked deterioration in drawing designs from 
Memory, paragraph interpretation, and the Ball and Field test 
(Stanford-Binet year 12). In all probability less marked tempera- 
Mental deviations also affect test performance qualitatively. 

Top level differences are also significant. Hollingworth and 
Cobb (g.v.) studied 20 pairs of children matched on home con- 
ditions, whose mean I.Q.’s were 165 and 146. The brighter were 
Markedly better in more complex tasks, such as word interpreta- 
tion, language use, and mathematical thinking. In routine tasks 
there was little difference. The two groups entered school at mean 
M.A’s of rr-rr and 13-4. By the time the duller group had 
reached a mean M.A. of 13-4, their school performance was not 


€qual to that of the brighter. 


the psychological 
the assignment of 
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All aspects of personality are tied in with a mental test score 
and what it measures or indicates. Thus it has long been known 
that there is an association between low mentality and delin- 
quency (v. Pintner, 1931, for evidence up to that time, and Lane 
and Witty for a summary of later findings). So Ackerson (q.v.) 
finds a much higher proportion of behavior problems among chil- 


dren of low I.Q. among a population of 5,000 of which he made ' 


a study. Yet we certainly cannot say that low intelligence is a 
direct cause of delinquency. The truth is that less intelligent 
individuals tend to be easily led, passive, timid, easy victims for 
vicious leadership and suggestion, often fixed in an environment 
where opportunity and stimulation are lacking, and so unable to 
use what capacities they have (Doll, 1934, 1940). How much 
social setting and tradition have to do with delinquency is shown 
by the fact that it is found much more frequently in boys than 
in girls, the proportion running from 5 to 3 to 7 to 3. Nevertheless 
the typical delinquent is at most of dull normal mentality, and 
an LQ. of 80 is an important diagnostic sign, indicating the sort 
of problems and difficulties the person is likely to encounter in 
his living. 

So, conversely, is a very high intelligence score. Reference has 
already been made to the study by Lamson (1930) in which she 
investigated 56 very high rating children, comparing them with 
a paired control group and following their careers through high 
school. The high rating group greatly excelled the controls in 
scholarship on all indices. They also had fewer school failures, 
but these in the main were due to independence, conceit, a refusal 
to conform, and so on. Disciplinary troubles, too, came from the 
same causes. This group was also high in extra-curricular par- 
ticipation, with activity honors in excess of expectation. The very 
brilliant person, if he has difficulties, manifests them by being 
idiosyncratic, peculiar, haughty, impatient of control, critical, and 
conceited. But it would be quite as absurd to say that high intelli- 
gence is a direct cause of these traits as that low intelligence 
causes delinquency. High intelligence is no guarantee of good 
adjustment, but like low intelligence, it indicates the type of 
maladjustment to be anticipated if it occurs (Hollingworth, 1940). 

To refer once more to genius, Cox (g.v.), in an elaborate study 
of 301 persons of the highest eminence who lived between 1450 
and 1550, estimated a mean I.Q of not less than r55 and prob- 
ably about 165. Some of the actual estimates of intelligence 
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quotients were: Byron, 150; Scott, 150; Darwin, 135; Goethe 
| Mill, Macaulay, Pascal, and Leibnitz, all above 180. But Cox also 
k Points out that for the greatest achievement favorable character 
traits are likewise of the highest importance. Galton’s famous 
threefold formula for achievernent still holds. It consists of great 
ability, power to work, and willingness to work. Hollingworth and 
Terman (q-v.) have remarked that not one person in four or five 

Of those of I.Q. 180 and over have all three characteristics. 


| 3. Conclusion 


The clear general conclusion is that psychometric scores, be- 
Cause they are quantitative and obtained under uniform condi- 
tions, are extremely valuable indices. But they always need to be 
interpreted in the light of all the facts, social and personal. They 
we can greatly aid us to understand and guide human beings as total 
Personalities in total life settings. This in itself is evidence that 

© working concepts around which tests are built are reasonably 
Well founded and provide authentic guide lines for the under- 
Standing of human mentality, and that the tests themselves are 
reasonably valid. The contention of Thomas and others that psy- 
Chometrics is essentially atomistic or mechanistic, that of neces- 
Sity it deals only with parts and never with the organic whole, 
'S nonsense, Of course, psychometrics is analytic, but no phe- 
nomenon can be explained, or understood, or managed, without 
analysis, The only questions are whether the analysis is tolerably 


Sound, and whether its results are put to wise and constructive 
Uses, 
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QUESTIONS FoR Discussion 


1. Does the assumption of a basic constancy of mental capacity 
seem reasonable? Why? Can you find reasons against it? 

2. If a person changes when he learns, what, if anything, in his 
mental make-up can be considered constant? 

3. How do the data here presented on mental growth seem to you 
to bear on the problem of constancy? 

4. Why has it seemed so very important to determine whether or 
not the I.Q. is “constant”? 

5. Carefully analyze and formulate the psychological reasons why 
the very idea of a single linear growth curve may be untenable. 

6. If you were constructing an intelligence test for adults, what 


kind of items would you tend to choose? What kind would you tend | 


to avoid? 


7. Examine a number of standard intelligence tests, and critically 


appraise their items from the standpoint of measuring adult intelli- 
gence. 


8. Would any of the criticisms of intelligence tests from the stand- 


a 


| 
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Point of their use with adults apply also to other types of tests, e.g., 
Some of the personality tests previously discussed? 

9. Check over in detail and with care the statistical procedures 
Used by Odom and by Thurstone in trying to establish equal units 
and a true zero point. What logical or statistical assumptions do they 
Seem to make? How does this affect the interpretation of their results? 

to. If broad statements as to the proportional effect of heredity 
and environment cannot be validly made, is the whole issue without 
Practical] importance? If this is not true, what importance has it? 

Ir. Suggest some of the matters concerning vocational, educational, 
Social, and business relationships and activities, and in fact concern- 
Ing the whole range of a person’s living that might be indicated by 
either a high or a low intelligence test score. How certainly would 
hey be indicated by such a score? If not with complete certainty, 
Would the score still suggest possible interpretations, cautions, etc.? 


hat other factors should be considered also? 


CHAPTER XI 


THE EVOLUTION AND IMPROVEMENT OF 
MENTAL TESTING 


PLAN OF THE CHAPTER 


The time has now come to weave together the various lines of 
discussion presented in this book into a unified perspective, and 
to see how the testing movement as a whole has evolved and is 
today evolving towards increased effectiveness. Since the first 
introduction of Binet’s practices into the United States and the 
early work in group testing during World War I, there have been 
many developments. Some of them have turned chiefly on in- 
creased efficiency in administration, scoring, and so forth, and are 
well typified by the contrast between the Terman Group Test of 
Mental Ability and the Otis Quick-Scoring Tests of General Men- 
tal Ability. These are certainly improvements, but they do not 
go very deep. Other developments have turned upon the extension 
of testing into new fields, such as attitudes, values, opinions, per- 
sonality types, morals and conduct, and aptitudes of many kinds. 
Here there have been some successes and many failures, and 
valuable insights have been gained into what are and what are not 
promising areas for psychometric application. Of deeper signifi- 
cance has been the development of types of tests which are new, 
not in the sense of bringing measurement to bear upon regions 
previously unexplored, but rather in the methods used in their 
construction, and in the kind and significance of the scores and 
norms they yield. It it here that a certain continuity of evolution 
may be discerned—an evolution still going on, and pointing to 
better achievements for the future. 

The core idea of this evolution can be briefly and simply epito- 
mized. It is a search for testing instruments and practices richer 
in psychological significance. The central question is and alway 
must be that of validity. What and how much does a test per- 
formance mean? As we have seen, existing instruments have 2 
palpable and indubitable meaningfulness which only the preju- 
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diced can deny. But the main trend of evolution is towards the 
Making of tests that will yield surer and more significant indices 
of the quality of the individual’s mental life, and the probable 
effectiveness of his social behavior. That this is a subject of the 
ighest theoretical interest to the specialist in mental testing 
Clearly needs no demonstration. But it has a practical importance 
also, for only by an intelligent appreciation of what has been 
accomplished and what is being attempted can any person use and 
Mterpret existing tests judiciously, or avail himself of those that 
are emerging. 

_ This quest for greater psychological significance or more authen- 
tic validity will be considered under four aspects—the search for 
he most stable and meaningful unit of measurement, the search 
Or the most significant interpretive standardization, the search 
for new and better operating concepts, and the tapping of new 
Psychological resources, this last having to do with the emergence 
of Projective rather than psychometric tests. All these four aspects 
are interrelated, and their separate treatment, like all abstraction 
and analysis, is only for the necessary purpose of clarification. 


Tur STABILITY AND MEANINGFULNESS OF UNITS OF 
MEASUREMENT 


For any unit to be stable, it must fulfill two conditions. First, 
there must be a fixed point of reference, an origin, whose meaning 
Is unambiguously defined. When temperature is being measured, 

€ point of origin is established by the freezing point of water 
vader stated atmospheric conditions. On the centigrade scale this 
'S Called zero, and on the Fahrenheit scale thirty-two degrees. Here 
as always, the name or symbol used does not matter so long as 

S reference is unmistakable. When weight is being measured, the 
Point of origin is established when an equalized balance scale is 
orizontal. When distance is being measured, the feet, or centi- 
meters, or other units are laid off from a zero point from which 
e Measurement starts. No scale of units which does not yield 


Such a fixed origin can be used to obtain stable results. 

The second condition is that each unit must be equal to every 
Other unit, or vary from it according to a known law. The lati- 
tude units on a globe illustrate the former alternative, and those 

g 


Ona Mercator projection map the second. Inequality of units does 
Not matter so long as the law of their variation is known. One 
g 
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might roughly measure the speed of a car by recording the amount 
of gasoline consumed per equal unit of time. The relationship 
would be a changing one because it takes much more power to 
accelerate from sixty to seventy miles an hour than from ten to 
twenty. But although it changes, the law of its change is known 
and can be allowed for in our computations. 

These same two conditions must be fulfilled by units and scales 
of mental measurement if they are to possess the characteristic of 
Stability, A 

As to meaningfulness, a unit of measurement is meaningful in 
so far as it enables to predict, or understand, or deal with a range 


of actual phenomena. Degrees of temperature, inches, or zoma 
are meaningful in this sense. They fulfill the logical and statistica 
requirements for stability, 


or they would be useless. But they are 

much more than abstract statistical entities, because they enable 

us to deal with phenomena and experience by giving answers to 
the question: How much? 

Units of psycholo 

in this sense. It mu: 


preted by the use of 
make its broad meaning 
derived score or unit that shall 


be as stable as possible, and also have as much psychological 


significance as possible. 
It is in the light of these ideas that the most important units 


af mental measurement will be discussed, beginning with global 
measures and turning them to profile ratings. 


1. Global scores 


A. Mental Age. This is a derived 
cal age. This at once 
namely, chronological 
almost always defined 


score based upon chronologi- 
gives it a definite reference point or origi 
age at birth. In psychometric practice it 15 
as the mean test performance or raw score 


bs 
7 
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of an age group, although logically it could also be defined as the 
mean age of a group achieving a given test performance or raw 
Score. Thus the M.A. fulfills the first condition for stability, that 
of haviħg an unambiguous origin of reference. The second con- 
dition, however, that of equality, it does not fulfill very satisfac- 
torily, and this is one of its chief limitations. The real differences 
in mental ability indicated by equal M.A. differences almost cer- 
tainly vary and become less as age advances. Thus there is more 
change in mentality between M.A.’s 4 and 5 than between rr and 
12. But since the form of the mental growth curve is not known 
and perhaps never can be known, the law of these changes cannot 
e determined, as it can very easily on a Mercator map of the 
World, and only very rough allowance can be made for them. 
Also, the M.A. is not applicable above the “age of arrest,” which 
differs for different tests, and means the chronological age at which 
test scores do not increase regularly as C.A. increases. Thus the 
M.A. is by no means a perfectly stable unit of measurement, but 
It has enough stability to be used to good purpose with proper 
Precautions. 
to meaningfulness, the following points are important. (a) 
The M.A. is a measure of the level of mental development or 
Maturity, and not of absolute brightness. Thus the proper time 
for first grade entry is often set at M.A. 6, without direct reference 
to the I.Q., which has a different significance. (b) It depends for 
‘ts significance on the type of test used. M.A.’s on verbal tests 
and performance tests correlate positively but only moderately. 
Thus, in using the M.A. for the evaluation or guidance of any 


Person, the test on which it is computed should always be con- 


sidered, (c) A given M.A. has different qualitative and behavior- 
'Stic meanings at different chronological age levels. This we have 
already seen. If two children C.A. 6 and 12 both have an M.A. 6, 
ìt indicates quite different patterns of social behavior and types 

f mind. To put it more obviously still, no one would think a 
Person CA, 30 and M.A. 6 was fit for first grade entry. These, of 


Course Feat in they are not fatal. In using and 
. ar but again they 
t ere limitations, 3 ing cautions should always be 


Mterpreting the M.A. the follow 
Served. (7 ea at all to ages above about 16, and 
ma i) Tt does not apply At © (i) ‘The real value of MAA. 


Not ver i 
: Yy well to very young children. (© i 
differences pene 4 age advances, just how much we do not 
Now. (c) The test from which the M.A. is derived should always 


e Considered. (d) The C.A. of the person concerned should 
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always be considered, and indeed his whole personal and social 
setting. 

B. The Intelligence Quotient. The intelligence quotient, in the 
only legitimate sense of the term, is the ratio of mental age to 
chronological age. The problem of its stability has given rise to 
numerous misunderstandings and should be stated and understood 
as clearly as possible. 

With the intelligence quotient, as with any other unit of meas- 
urement, a fixed and unambiguous reference point or origin is 
essential. This is determined as the coincidence of M.A. and C.A- 
The designation 100 is applied to it, but this is merely a matter 
of convenience. Any other symbol could be used, just as the refer- 
ence point determined by the freezing of water is sometimes called 
32° and sometimes o°, Age scales such as the Stanford-Binet are 
organized, and their subtests are arranged in such a way that the 
mean I.Q. of each age group will be about roo. This does not in 
the least prejudge the question of whether any person’s 1.Q. will 
remain constant, any more than the setting up of the scale of a 
thermometer so that the symbol o will be reached when water 
freezes means that the temperature of any object to which the 
instrument is applied will always Stay the same. It is simply the 


fulfilment of the first necessary condition for the stability of 
any unit. I.Q. 100 at any level always means the coincidence of 
C.A. and M.A, 


Passing on to the second condi 


tion, that of equality, this means 
that any other I.Q. value, say j ‘ $ 


‘ 70 or 120, must also invariably 
mean the same, just as temperature readings of so° or —20 


always mean the same. Putting it otherwise, 30 1.Q. points below 
or 20 above roo must always mean the same at all ages. If this 
is to be so, two requirements must be met. (a) The distribution 
of 1.Q.’s at all ages must be approximately the same in range- 
This requirement is not perfectly fulfilled by any test, even by 
the Revised Stanford-Binet scale, which comes nearest to it. AS 
is shown in Table 10 and Figure 6, the distributions of Stanford- 
Binet 1.Q.’s fluctuate considerably at different ages, ranging from 
about 12 to about 20. As has been pointed out, this implies that 
any given LQ. other than roo will have changing real values, oF 
conversely that a person whose mentality remains the same wil 
have a changing 1.Q. The inaccuracy is not great for 1.Q.’s fairly 
close to 100 but becomes more and m 


ore serious as deviations 
increase. As to many other tests, distributions often vary so much 
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that the 1.Q. or other such measure loses all significant stability. 

his is so with the Merrill-Palmer scale, as we have seen. It is 
Strikingly true of the Pintner Non-Language Test, as is shown by 
the data reported by Rand shown in Table 58. Test performance 
One standard deviation above the mean (100) would give a C.I. of 
167 at GA, 7, and r15 at C.A. 14. Thus there are certain tests 
Where the use of the intelligence quotient is inadmissible. (b) If all 
-Q.’s other than roo are to have a stable meaning, it is neces- 
Sary not only that the range or amount of the age level distributions 
ne Approximately equal, but also that their form be the same. If 
at one age level there were a distribution shaped like that shown 
1n Figure 29a (i.e., normal), and at another age level, like Figure 
298 (Skewed), then even if the ranges and the S.D.’s were the 
Same, they would not be equal units. This requirement of normal 
distribution at all ages seems to be approximately fulfilled by 

tanford-Binet I.Q.’s up to C.A. 16, beyond which it becomes a 
Pure assumption. As to other tests, they vary considerably in this 
respect, So the regular appearance of normal distributions should 
never be taken for granted, and the direct evidence presented by 

7 authors should “always be noted and checked. (For a fuller 
discussion of this subject, though in terms of somewhat outmoded 


data, see Rand). 
TABLE 58 


STANDARD DEVIATIONS OF COEFFICIENTS OF nai oN 
PINTNER Non-LANGUAGE TEST FOR VARIOUS AGES 


(From Rand, Table 1, p. 605) 


13 14 


Age r, 10 
LT T 7 8 9 33 I5 


St 
dard Deviations .<. 67 73 59 37 3? 


rome) 


a 
9 
w H 


| As to the psychological significance of the intelligent quotient, 
1t turns upon a very simple, straightforward, eo 
Most too f aili ts. This is that the age when a 
amiliar to most paren AS ee cea 
Menta] function or performance appeats 1S highly indicative: To 
arn algebra at 15 is not remarkable, but to learn it at 6 indicates 
ndowment of the highest and rarest order. This piae of er 
Sense į cti in the I.Q. Whereas the M.A. 
S embodied in statistical form 1n 
ĉasures mentality in terms of its developmental level, the L.Q. 


e as : 
aSures it in terms of brightness. 
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It is sometimes said that the I.Q. indicates the speed of mental 
growth. But this is much more doubtful, since the very idea of 
a growth curve and so of an oy h 
question. There has been considerable discussion as to the growt 
patterns implied in the use of the 


pointed out two alternatives. Suppose that three children, A, B, 
and C, respectively, 


growth may start at 
ferent speeds, so tha 


ent origins and run parallel, 
ains constant. The conditions 
neasure may be met in either 
th curve is a straight line, oF 
c curve. This can be demons 
e interested reader is referre 
to Freeman’s discussion, But its psychological significance, if any, 
is not clear. On the whole, Terman’s position in saying that the 

+ requires no assumptions as to the form of the 
ms the one to accept. 


of the 1.Q. is to give a stated degree of bright- 


with point scales. The C.I. is t 
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Aaro a 
his CLI : p: 1 b E viduals point score 1S 120, 
tētist 120. This measure has virtually the same logical charac- 
they acs and general meaning as the I.Q. The point of reference is 
on mean Point score for the age group, which is always 100. To 
an isin unit, it must show even distributions for different ages, 
I must be normal. Much less study has been made of the 
the T Oc of the LQ., and it cannot be compared certainly with 
dae C on the basis of demonstrated constancy because of lack of 
that na. Here again the essential idea is to have a measure 
Hess at the same designation to the same degree of bright- 
a ages. 
th D: — I ndex of Brightness. Unlike the two previous measures, 
tather ex of brightness, or LB., turns upon the use of a difference 
Sein than a quotient. It is the difference of the individual point 
ftom Ea the mean point score for age added to or subtracted 
the I Te If the mean is 110, and two individuals score 90 and 130, 
E -B.’s are 80 and 120. The measure has not been much used. 
Wage Percent of Average. This is Kuhlmann’s variant on the 
devel S (qg.v.) personal constant. It means the percent of average 
opment as shown on the alleged normal growth curve attained 
a individual at a given age. Three advantages are claimed 
the ~~ It is based on an actual curve of normal growth, unlike 
th .A. and the I.Q. (b) It has a greater statistical stability 
an the I.Q. That is, its range of distribution at different ages 1s 
re uniform. A comparison of Tables ro and rr will show the 


Prima facie evidence for this claim. (c) There is a tendency, noted 
for high I.Q.’s to rise and low 


o P. Cattell (1931) and others, n 
€s to fall. This is said not to be true of the P.A. (v. Hilden). 
a this the following replies have been made. (a) The alleged 
an curve is mythological. Its derivation by Heinis (q.v.) is 
Po ound, and there is doubt whether any general growth curve is 
=) sible. (b) The appearance of greater statistical stability is 


l A 
r lusory, When the fluctuations are considered, not absolutely but 
the scores themselves, the advan- 


in SY. 

“he elation to the magnitude of is 
Se of the P.A. over the I.Q. vanishes (McNemar, 1942). (c) 
th ere is evidence that low P.A.’s tend to drop and it is probable 
at high ones tend to rise, similarly to the LQ. (P. Cattell, 1933). 
buttal to the criticism that 


Sin uhimann (1939) has offered a re 
ce the form of the growth curve is unknown, the P.A. is illusory. 


S essential purpose, he says, is to give the same designation to 
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the same degree of brightness at all ages. Thus it would still be 
valuable even if the very notion of a general growth curve were 
admitted to be untenable. 

E. Purely Statistical Measures. There is a tendency to prefer 
to all the above, purely statistical measures without psycholog!- 
cally suggestive names. These are based on some unit of dispersion. 

(a) The raw scores may be arranged in percentile ranks, as with 
the Otis scores shown in Table 15. This tells us what percentage 
of all scores is exceeded by any given score. Thus in Table 15 
the raw score of 7o is higher than 84% of all scores attained. The 
limitation of this method is its distortion of the data. In the table 
a raw score difference of x between 62 and 61 is a percentile 
difference of 3 points. But the raw score difference of 8 points 
between 33 and 25 is a percentile difference of less than 1 point- 
An illustration will help to show why this happens. Let us sup- 
pose that 20 players are entered in a tennis match, that the middle 
To are very close to each other in ability, and the top 5 and the 
bottom 5 quite widely spaced. If the players are ranked in centiles, 
a single percentile difference in the middle of the distribution wil 
stand for a much smaller difference in real ability than at the toP 
or bottom ends. Also a very small shift in performance among 
the middle xo will yield a percentile shift, while it would take ê 
considerable shift in performance at either end to change thé 
percentile tanking. Thus, if we want a measure that will give the 
same designation to the same amount of brightness at all ag 


and all ages, the percentile is not 1 i 
* A ot likely to do so, because 
is affected differently at different levels by changes of bright- 
ness, 


(b) To avoid this, measures based on the standard deviatio? 
and generally recommended. Ww 
5. Here the deviations of the se 
; in S.D. units are converted int 
plying them by ro and adding 50 to avo! 
method of deriving his eee 
a . Ceviation scoring. He uses the prob” 
able error, which is .6745 t the standard deviation, and ae 
fines 90 I.Q. as x P.E. below the mean of the age group. Som? 
S, usually the standard score, 1$ 


stable statistically of all global 
units of mental measurement. A given test performance:is.alway® 


scored in terms of its distance in S.D. units from the mean- If 
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an and such a test performance is ı S.D. above the mean at 
Se 6, and another test performance is 1 S.D. above the mean at 
age 12, then both are of equal difficulty for their age groups. The 
Breat stability so obtained may be seen from Table 12, where the 
D, S of the Wechsler so-called I.Q.’s are given. In point of sta- 
pity, i.e., uniformity of distribution, they are clearly superior to 
oth Stanford-Binet I.Q.’s and P.A.’s. 
tnalcNemar (1942) and Terman and Merrill (1937 b) have never- 
à eless defended the I.Q. as preferable to the standard score. 
ae this they give two reasons. First, teachers and parents and 
a ers understand the I.Q. and the M.A., but do not well under- 
t and the standard score and cannot interpret it. This may be 
rue. But it has been urged in reply that persons who are con- 
cerned with tests and test results should be educated to under- 
stand the best available measures, and that so-called lay under- 
tanding of the I.Q. consists of a good deal of misunderstanding. 
iss second point made by McNemar and Terman and Merrill is 
it at the LQ. is itself in effect a standard deviation score, since 
» too, indicates the divergence of an individual test performance 
ie the age norm in terms of the age distribution. However, this 
only approximately true, as we have seen ; and there seems no 


on reason for preferring a measure somewhat inaccurately ex- 
essing deviation from the mean and weighted with a name full 
irely accurate in this 


pe plications to a measure which is entirely 

wee and without any psychological suggestions or overtones 
Atsoever, 

is eee practical effect of all thes 

Who tow more of the burden of inte 

Uses the test and its results. Thi 


80 
tenis te should be a stable measure 


e purely statistical measures 
erpretation upon the person 
s is just as it should be. A 
with clearly defined charac- 
as r ics, i.e., as indicating brightness or maturity. It should show 
i nambiguously as possible the standing of any individual rela- 
atoi to the ability defined and embodied in the test. It should 
ardi forcing interpretations which might be valid for the stand- 
i oration group but not for other groups oF individuals. It should, 
Not er words, tell as directly as may be a factual story. It need 
I xs or this reason, lack psychological content and significance. 
inten; know that a person is 2.5 S.D. above the mean on a good 
knoy 'gence test, this is a more accurate plece of information than 
ang Ving that his I.Q. is 135, because this S.D. deviation is directly 

Without question comparable to all others of the same amount 
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at other age levels, whereas an I.Q. of 135 at 8 is only approta 
mately comparable to an I.Q. of 135 at 12. Everything res 
or implied in the I.Q score is contained and implied in the S.D. 
score, and it is more precisely expressed. d 

The only reservation to make in connection with standar 
scoring is that unless the distributions are approximately ere 
it becomes seriously misleading. The magnitude or range of the 
distribution, so long as it remains the same, does not affect a 
situation, but its form does. In the case of the best-constructé™ 
tests there is fairly good reason to believe that the distributions 
are approximately normal, although conclusive proof can E 
be had. The caution, however, is well worth observing, since there 
are plenty of tests on the market which are neither well con 


p 
structed nor competently analyzed, and whose authors seem t 


think that a normal distribution is to be assumed as a consequence 
of natural law instead of something for which evidence should 
forthcoming. Ë 
F. The Accomplishment Quotient. The accomplishment a 
tient is the ratio of educational age to mental age, or of g t 
cational quotient to intelligence quotient, the idea being oe 
formulated by Franzen (q.v.). It is intended to measure effort © 
motivation. If educational age is less than mental age, the ae 
is less than roo. If it is equal. to mental age, the A.Q. is 100. +t a 
is greater the A.Q. is above 100 (v. Stebbins and Pechstein i 
Cureton). It must not, however, be thought of as a ratio betwee 


: 3 the 
pure intelligence on the one hand, and pure achievement on an 
other, for intelligence tests and achievement tests have much 
common. Rather, it is a rat 


io between a less and a more gener 
measure. It has the following disadvantages, (a) It tends to pena 
ize the bright pupil, who ca 


I nnot work “up to capacity” as readily 
as the dull pupil. (b) Since it is derived from two tests, one me 


tal and the other educational, it has a lower reliability than eithet 
(v. Chapman). (c) It will not be a valid measure unless thé 
distributions of the two tests are approximately equal, for only 
then will a given EQ. (say, 120) be comparable with the ee 
nally corresponding LQ. of 120. Unless the two distributions at 
the same, the two seemingly identical figures will have differe? 


F t 
real values (v. Freeman, 1939). As a formal instrumentality * 
cannot be recommended. But the same idea can be applied in 


ability standard technique (v. Symonds, 1927). Mean scores a 
educational tests can be computed for various intelligence lev 


iea 
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groups, and these can be used as standards for pupils whose intelli- 
&ence ratings are known. 


2. Profile Scoring 


As we have seen, there is today an increasing criticism of global 
Scores on the ground that they conceal relevant differences under / 
average ratings, and thus obscure diagnosis and analysis. The 
alternative is the profile rating, which presents a pattern of scores 
On separate traits or functions, with or without an average or 
Over-all global rating as well. Seashore, for example, definitely 
objects to the averaging of scores on the six subtests of his bat- 
tery, and claims that results should always be reported as a pro- 

© of pitch discrimination, time discrimination, loudness dis- 
crimination, timbre discrimination, rhythm recognition, and tonal 
Memory. The separate scores on which the profile is constructed 
are usually coniputed on the basis of percentile ranks or standard 

€viations. In the above case the percentile method is used. The 
Comments alreacly made on these methods apply here and will not 
e repeated. 
Tn all profile rating the key question is this: How are the sepa- 
Tate traits or abilities which the profile represents conceived and 
efined ? If they are ill-conceived, ill-defined, and so not authentic, 
© profile means nothing at all. It is indeed worse than useless, 
°r it becomes highly misleading. A few illustrative references 


Will make the issue clear 
he six traits embodied in the six subtests of the Seashore bat- 
tery are without doubt defined both in words and in terms of test 
items with great precision. One may fairly ask whether they really 
te Components of musical talent and whether the subtests are 
reliable enough to yield a stable profile when the differences are 
“mall. But with these reservations, the obtained profile has an 
“Xact and indubitable meaning. In the same way the Minnesota 
Tultiphasic Personality Inventory is capable of establishing sig- 
cant and important diagnostic differentiations for most, though 


perhaps not for all, of its different scales, because of their well- 
th, ‘ned meaning, based on psychiatric theory and pad S 
x €r careful embodiment in test items. The same eoa oa 
tos Bernreuter Personality Inventory, eo ats Ra ji > eel 
sth. itroversion-extroversion, neurotic tendency, ait - 
mission, self-sufficiency, self-confidence, and sociability—are 
class profile based on the 


u 
nstable, indefinite, and dubious. A three- 
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. . . Lg in 
Detroit General Aptitudes Examination, to show antes © 
general intelligence, clerical aptitude, and mechanical apti ri 
would almost certainly have little meaning, because the functio 


are not well clarified and defined either verbally or in the test 
items. 


Among intelligence tests the W 
lishes an effective and meaningful 
formance-test intelligence and verb 
based on its rr separate subtests 
Sess considerable diagnostic effici 
American Council on Education 
Port to establish two-class profi 
ability, though 
lished so far as 
Primary Mental 
this direction in 


echsler-Bellevue scale estab- 
two-class profile between per- 
al-test intelligence, and profiles 
are found by Rapaport to ee 
ency. The latest editions of t 


ability, space-visualizing ability. 


n È duc- 
» memory, induction, and de 
tion. Conside: 


to 
nd intricate research devoted i 


Spatial relationships, reasoning, ee 
ave question. As Kuhlmann says, 
proposition of showing that a child has a “good memory” is one 
cause very considerable hesitation. As to the Detroit Tests 
Learning Aptitude there can 


any underlying framework of We 
analyzed well-defined, signi 


Even 
i Psychological concepts. 
global scores will be better than that! 
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STANDARDIZATION 


As shown repeatedly in these pages, the very essence of psy- 
chometric procedure turns upon the interpretation of a given 
individual’s test performance in terms of the performance of a 
Population or universe of discourse. But since the entire popula- 
tion can very rarely be tested, recourse must be had to a sample, 
known as a standardization group. Mental tests never measure 
any psychological function or ability directly, but always by set- 
ting up a behavior pattern believed to embody it, and instituting 
a comparison between that behavior as manifested by the indi- 
vidual subject and by the population, through the mediation of a 
Standardization group. This fact, that mental measurement is 
always relative or comparative, is nothing against it. A great many 
Such comparative judgments are constantly made in everyday life, 
and indeed in science. If the average length of the noses of two 
racial groups differed by one centimeter, it would be a very strik- 
Ing phenomenon. But if there were one centimeter of mean dif- 
erence in their standing height, it would be far less noticeable 
and Presumably less important. One centimeter, or one milligram, 


OF one degree of temperature centigrade have very different mean- 
both common-sense 


Ns in different situations; and judgments, À e 
and scientific, always depend upon the interpretation of obtained 
differences comparatively or relatively to their background. There 
'S, therefore, no objection on a logical basis to this aspect of psy- 


Chometric procedure. 

ut experience and research have brought to the foreground 
many problems in the actual application of the logic of compara- 
lve judgment in the field of mental measurement. That the stand- 
ardization group must be a representative and adequate sample 
Soes without saying. For this the primary consideration is not its 
Size but its selection. And the selection of a true and unbiased 
Sample of human beings with respect to some trait 1s exceedingly 


lfficult. Indeed, in virtually all test construction, even the very 
|; f the standardization group. 


cst, there are flaws in the selection © 
ut when the work is done by the best and most careful and 
ood enough for satis- 


thorough methods, the approximation is g l 1 
actory practice. The improvement of sampling techniques is prob- 


it ly not a major point in the evolution of better tests. At any rate, 
1S not a major point in current psychometric research. 
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A much more immediately cogent question is: Against just = 
population is a test being Projected and its interpretive ee 
developed? One might say that if a test is to measure inteligenca 

i r of inductive reasoning, or schizoi 
tendency, it should surely contemplate these functions in the 
e Supposed to be general or universal 
ch means that they may be mani- 
1e degree. This, in a sense, no doubt, is true. 
Surely, then, it is necessary to compare or rank or rate the indi- 
vidual against his universal human manifestation. And if so, the 
only adequate standardization sroup will be a true sample of the 


dreams of such universal tests, as 
point out (g.v.) But in the main they 
functions it is Possible to measure on 
scale are the simple sensori-motor processes, 
ne or intensity discrimination. And even puen 
i i “very existing mental test is 
xplicitly selected pan 
hosen accordingly, Thus the 
andardized in terms of a popu- 


applied to Negroes, OF 
Cans, or to some very 
goes beyond its contemplate 
» and there is always the question 0! 
€ interpretations. 
o develop special group norms 
assifications or school grades 
r, is exceedingly difficult 
rge scale, although there hav 
» Such as dual standardization 
on city and country children, Also, there is the question of whether 
we want to compare an individual with his own particular limite 
or against a wider setting. Shal] we compare Negroes 
ural children, and so on? OF 
er criterion Stoup, including rural children, 


en of Professional, and business, and skilled, 
and unskilled laboring Parents in about the proportion of thes 


classes in the general] Population ? Which will yield the more 
meaningful interpretations > 


application, Strictly Speaking 
whether its norms mg 
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This final question itself implies the answer. It depends upon 
the Purpose for which the test is intended. On the one hand, there 
has been a development towards the special purpose mental test. 
Army Group Intelligence Examination Alpha itself, in its original 
form, was in effect such a test although not a highly evolved 
example of the type. The University of lowa Placement Tests are 
a better example. One cannot say that they measure simply aca- 
demic aptitude, still less “University of Iowa aptitude.” What 
they do measure is intelligence as it functions in the curricular 
Setting of the University of Iowa. Therefore, it is proper and 
desirable to standardize it on a sample of this intelligence. So, if 
We think of the census group instead of the racial group, and want 
to measure the inductive reasoning ability of children of Welsh 
descent in Wilkes-Barre, there ought tc be special interpretive 
Norms for this purpose. It may be perfectly proper to use a test 
l More broadly standardized if a broader comparison is desired, but 


the true meaning of the scores must never for a moment be for- 
8otten. Thus one tendency is towards standardization in terms of 
the special functional group for which the test is intended. And a 
functional group, be it noted, is not a mere classification, such as 
that of all persons earning between $5,000 and $10,000 a year. It 
1S a group brought together by some community of purpose which 
Makes it significant, such as candidacy for medical school, or for 
a given college, or perhaps life on a canal boat. This practice of 
Standardizing tests on defined functional populations is a distinct 
and gtowing development. 
| On the other hand, interest may center upon the mental process 
“Self, and then the more inclusive its representation the better. 
To Standardize a test of primary mental abilities on a limited 
functional group would certainly be questionable. Limits there 
will undoubtedly be. One would presumably not include Hotten- 
tots or Eskimos in the sample, unless experimentally. But the 
Wider those limits are, the sounder the logic would appear. 
Vhich of these two trends contains the greater promise is an 
pen question. Both are present today, and both no doubt will 
°ntinue. The important thing is to understand them, and to see 
them as involving a quest for more intelligent and discriminating 
Procedures of standardization and interpretation in the light of 


€ Purposes contemplated. 
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New OPERATING Concepts 


The quest for new operating concepts around which tests can 
be constructed is today being followed along two widely different 
lines. The first is the realistic study of the function to be measured 
as it appears in some actual setting, this being an extension of 
the idea of the special purpose psychological test. The second is 


the technique, or rather the body of techniques, known as factor 
analysis. 


1. The study of function 


The practice of building tests around a realistic study of the 
function to be measured is, of course, not new. But it has been 
excellently exemplified in much of the psychometric work done 
in the United States Army during World War II. It seems very 
probable that this work, like that done in World War I, will set 
the patterns for many future developments. The tests themselves, 


being intended for military uses, for the most part may not be em- 
ployed widely, as was Army Group Intelligence Examination 
Alpha. For this 


I reason they have not been described at length BS 
this book. Instances of them are considered here, not for their oW? 


sake, but to illustrate the basic principle of their construction, 
which is likely to find extensive application. 


A good example is the development of a test for radiotelegraph 
operators (v. Staff, Personnel Research Section, Classification a} 
Replacement Branch, Ad : 


ment jutant General’s Office, 1943 a). A survey 
of the situation revealed that the most common cause of failure 
among radiotelegraph operators in training was the failure t° 
learn code. Accordingly, two chief tests were constructed. The 
first was the Radiotelegraph Operator Aptitude Test. It consists 
of items in general like those of the Rhythm Test in the Seashore 
Measures of Musical Talent. Pairs of code patterns are presente 

aurally. The series increases in difficulty. The second member ° 

each pair is to be judged the same as or different from the first- 
The second test was the Code Learning Test. This consists of # 
30-minute learning pe si i 


- l riod centering upon 6 code characters. Aftet 
this there is a test with the 6 practiced d. 22 


; characters presente 
times each as part of a total set of 40, the additional 28 consisting 
of unpracticed code characters. The task 


i k is to discriminate © i 
practiced characters. Both tests were found to correlate well wit 


x Ci 
Aa——— SS SSS eat, 
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the ability to learn code up to a given speed as measured by time 
required. 

Another instance is the Aviation Cadet Qualifying Examination 
(v. Staff of the Psychological Branch, Office of the Air Surgeon, 
1945). This test was developed through a long series of refine- 
ments. A vocabulary subtest was considered but rejected. In final 
form it consists of the following subtests: Pilot Aviation Interests, 
Avocational Interests Related to Aviation, Driving Information, 
Hidden Figures (a spatial perception test), Mechanical Compre- 
hension (tracing and understanding mechanical relationships). 
The best of these, in terms of correlation with training success, 
were Mechanical Comprehension and the two interest tests. 

Another instance of test development is the construction of a 
battery for identifying potentially successful naval electrical engi- 
neers (Lawsche and Thornton). The final battery which emerged 
out of a series of experimental attempts, involving correlation 
with the criterion of achievement in the training course, consisted 
of a test of ability to read simple measurements and to solve 
arithmetical problems, another on electrical information, and a 
third revealing general alertness. The above was the order of 
Predictive efficiency against the criterion of training course grade 
Points, i 

Even a recorded failure to construct a valid test is not without 
its instructive aspect. The attempt was made to build a test for 
the selection of truck drivers. A number of visual and motor psy- 
chophysical tests were tried out, but had little selective value in 
terms of the criterion. The same was true of an experimental 

tiver-information test. The general conclusion was that the best 
test was simply a tryout on the road (v. Staff, Personnel Research 

€ction, Classification and Replacement Branch, Adjutant Gen- 
eral’s Office, 1943 b). It would seem that in this case the criterion 
Was so elusive, and the operative conditions of success so in- 
accessi st could be built. X 

The prina Seamless and immediacy of the criterion, and 

€ clarity with which it was possible to formulate the bani con- 
cept embodied in the test would seem to explain two e a a 
cessful features of mental measurement during World \ a ; 

e first of these is the wide use of brief tests for agen oN 
Poses, which has been mentioned elsewhere in this book. As n 

ord (1946) and others have pointed out, when a var we 
€fined process of selection was involved, or when it was desire 
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to make sure that certain minimal requirements were m we 
ventionally accepted standards of reliability and validity ees 
unnecessary. The second notable success was in the field o P - 
sonality measurement. We have seen that, with a very few eos 
tions, personality tests have not commended themselves high y 
to psychologists for general use. But during the war they Te 
found very serviceable (Hunt and Stevenson, 1946). The pensin 
was the simplicity of the problem. That is to say, there was a very 
close relationship between various pathological manitra 
and unfitness for service, Somnambulism, for example, may no 
usually be very serious in civilian life, but in war it can be no 
trous. So, too, for many other personality and behavior disturb- 


ances which can be indicated successfully in a relatively shor 
and simple test, 


Aircraft pilot train 
use of analytic tec 
effective tests Guil 
that the 2r scores 
only eight of the 


ee selection is an outstanding example of tre 
hniques ‘as a basis for the construction n 
as follows: “It was foun 
ification battery measure 


to be positively loaded 7 
the pilot-training criterion All of these factors, incidentally, a 
foreign to the usual intelligence test 


- The use of intelligence ioe 
8 those whose I.Q.’s are above 10° 
rom the estimated factor loatling: 
s in the pilot criterion, it could be pect 
that these factors, opti ighted in the test composite, wou 
yield a validity of about -60 for that composite, This was not far 
d. From results with experi- 

at there were nine other factors 
having positive loadings with the pilot criterion. Had the classifi- 
cation battery included them, Properly Weighted, the validity ° 

about .70. There were four other 
factors in which the Pilot criteri 


. ilot 

appeared to have substantial TEk 
these considerations. New factor 

were still undisclosed but indicated before the end of the war- 


F ai to 
‘ » the 21 factors with some gee 
recognition in the pilot criterion would orcinarily be ca 


R x : u De Pee, 
abilities. Whatever variances were contributed to the criterion b . 
temperamental factors were almost untouched. The conclusio 
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i be that the upper limit of validity for any battery is an un- 
fest a quantity Any estimate of it needs to be liberal and sub- 
the = mma as new factors come into the picture. Incidentally, 
Eo um vor of human factors, when they are much better known, 
= Probably run much larger than has been supposed. The hori- 
P Bae is slowly but surely extending beyond the con- 
Stan os the 1.Q. It is hoped that the horizon of temperament will 
PO „grow beyond the concepts of neurotic tendency and the 
9. (P. 434). 
eee pga may be ventured that the outstanding single out- 
eal of the experience of measurement in World War II is the 
sunk ote that psychometric advance does not depend upon 
<< ficial devices or quasi-clerical improvements, but upon a 
len tee more practicable, and better analyzed notion of what is 
eas mganred. In other words, as Jenkins (g.v.) puts it n a 
Se me fo which reference has already been made, the outstanding 
is the need for better analytic techniques for validation. 


2. 
Factor Analysis 
Factor analysis is a technique or group of techniques which is 


comi 
ming more and more into prominence. There have been numer- 
ages, and although no 


j ous N 2 . . . 
con references to its application in these p houg! 
prehensive account is offered here, a general appraisal is in 


o s ae 3 
rder, It is essentially a method of clarifying and defining basic 
4 empirical and trial- 


Cor P / P : 
Ncepts, not indeed inconsistent with more 
controls and refine- 


and. f 
d-error procedures, but introducing new J € 
embered that the California 


N P 
rage It will, for instance, be remen ! for 
a Personality is built on what 1s essentially an empirical, 
be igh very carefully considered, classification of types of adjust. 
nt, because the factorial studies undertaken by the authors had 
ful foundation for test con- 


Sonal; 

mae, tests. These instances an 

gene., ear the place and significance 0 
ral picture of psychometrics. 


Factor analysis is essentially a process of simplification. It is 
a f techniques, by which a large 


ma . > 
ete thematical technique, or array © : 
Togeneous set of measurements can be expressed in terms of 
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a few simple concepts. The idea applies far beyond psychology. 
For instance, in measuring and plotting the surface of the earth, 
it is possible to express what would otherwise be a baffling array of 
figures in terms of the two “factors” of latitude and longitude 


measurements, i.e., measurement 
that all the relationships among t. 
length, girth, head size, and size ctr 


© . 
à ’S motivation, or his reading ability, 


dge of psychology. These we might 
mp them all together, and call them 
performance, Then there would be 
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s ne aoi 
e e e a 
ti 3 Á i quiring discrimina- 
on of space relationships. Does a single mental process exem- 
Plify itself in all these types of performance? Is there anything 
“A common between verbal and arithmetical reasoning? If so, 
ae it? Or between vocabulary and the power to understand 
S i ficult paragraph? Or between the task of telling what a sheet 
paper will look like when unfolded after having been folded 
and cut, and the task of running a maze? Does the same ability 
Hor itd itself in arithmetical computation and in problem solv- 
fet Does aesthetic discrimination enter in with intelligence tests 
7 at use pictorial material? It is to such questions as these that 
actor analysis is directed. 
ni he general procedure is an extension of the correlational tech- 
eee A correlation matrix, such as that in Figure 32 is set up, 
torte intercorrelations of a number of tests. When the 
oe of the tests concerned is allowed for, the coefficients 
rai the true relationship or “commonality” between the various 
Ss. Some have much in common, some have less, some have 
very little. The tests that have high intercorrelations are pre- 
Sumably measuring the same thing to a considerable extent, or 
are “saturated” with the same factor to a considerable extent. 
Ose with low intercorrelations embody different factors in the 
Main. But all have at least something in common. On these 
assumptions mathematical analysis is applied to determine how 
Many factors are needed to account for the observed relationships, 
and also to determine as far as possible what these factors are. 


Discrimi- | Cancella- 


a 
n Comple- 

Opposites tion Memory | nation tion 

Odposites ...... ws .80 .60 «30 «30 
Completion ind 80 = 48 24 A 
Memory |, .60 48 me ne as 
Discrimination À .30 -24 18 a, 09 
Cancellation sia | «30 24 18 .09 = 


Fic. 32. HYPOTHETICAL CORRELATION MATRIX 


(Spearman, 192 7) 
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Spearman regards this as in substance a proof that completions, 
arithmetic, vocabulary, and directions exemplify a general factor 
(v. Spearman, 1927 b). 

Besides, the general factor involves factors of two other kinds. 
The first are those unique to it, i.e., the “special” factors peculiar 
to solving geometrical problems, or to understanding prose, but 
not common to both. Second are the “group” factors common to 
a number of performances, but not so universal as G. For instance, 
there is much in common between cancelling A’s and E’s in a 
prose passage that would not appear in handling geometrical 
Problems. Thus, for Spearman all mental performance is due to 
the general factor, certain special factors, and certain group fac- 
tors. Such, according to him, is the organization of intellect. 

(b) The view most sharply opposed to this is that of Thurstone. 
In his monumental work published in 1938, he gave over fifty 
tests to some 250 undergraduates, and computed about 1,500 
correlations. These correlations he explained in terms of a number 
of more or less independent factors which he called primary 
Mental abilities, among them being number facility, word fluency, 
Visualization of space, memory for words, perceptual speed, verbal 
reasoning, and induction. In his later work he has modified this 
list somewhat and further defined some of the primary abilities 
(v. Thurstone and Thurstone, 1941; Thurstone, 1940). Another 
Set of factors includes a general factor, a mathematical-mechanical 
factor, a verbality factor, a spatial relations factor, a memory 
factor, a mental speed factor, a deductive reasoning factor, and 
a motor speed factor (Holzinger). In Thurstone’s earlier work the 
general factor, or G, does not appear. Some years ago, however, 
Spearman (1939) argued that this is due to his statistical proce- 
dures, and that if Thurstone’s data are handled by a different 
method, a true general factor appears in them. Since then Thur- 
Stone (1944) has published work demonstrating what are known 
as second-order factors. First-order factors are derived from and 
involved in the test correlations themselves. Second-order factors 
are derived from and involved in the first-order factors. Thur- 
Stone provides an interesting and instructive illustration of the 
idea. Let us suppose, he says, that we have a number of rectangu- 
lar boxes of many different sizes, and also a set of measurements 

ach—the diagonal of the front, the area of the top, the length 
th e vertical edge, etc. These measurements would correspond to 

© test scores, and each box to an individual for whom scores 
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had been obtained. When the various sets of scores had been 
correlated, and the coefficients arranged as a matrix, this matrix 
could be factored into three primary factors, namely, length, 
width, and height. Then these three primary factors could again 
be factored into a single secondary factor, which would be called 
the size factor. Thurstone has expressed the view that what Spear- 


man called the general factor May reappear as a secondary 
factor. 


er, remark that this brief sketch 
notion of factor analysis. Its techniques 
actively applied. New results are constantly 


ng workers change their opini hat- 
opinions. But w. 
ever the account offered, the thing that koia be understood iS 


to each factor, and finally to reconstruct 
» So far as possible, “factorially 


and, it is hoped, better operatin 
Factor analysts criticize 


rationale or organizing id 
whose psychological cont 
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vealed in rapid and accurate calculation, Verbal ability revealed 
In verbal comprehension, Spatial ability revealed in the imaginary 
manipulation of spatial forms, Word fluency revealed in produc- 
ing numerous words in a given time, Reasoning ability revealed 
In finding relationships in presented material, and Memory ability 
revealed in rote memory. How these constituent factors are set 
Up in the tests will be seen in Figure 22. This is the outcome of 
the continuing work of Thurstone, which led to the conclusion 
that the six factors named were clearly enough isolated to be used 
In test construction (Thurstone and Thurstone, 1941). Another 
Practical example is the work of Flanagan (q.v.) with the Bern- 
reuter Personality Inventory. He found it possible to reduce the 
Original four trait divisions (Neurotic tendency, Introversion- 
e€xtraversion, Dominance-submission, Self-sufficiency) into two, 
namely, sociality and self-confidence. These have been built into 
the scoring key, and can be used in place of the other four. This 
1S an excellent instance of what is involved in factor analysis. 
Yet another example is the application of factor analysis to the 
Strong Vocational Interest Blank for Men. This yielded five basic 
Interest types, to wit: Interest in people, Business, Intellectual 
activities, Science, and Language (Strong, 1934, 1943; see also 

hurstone, 1931). Yet another instance is the California Tests of 
Mental Maturity, which purport to measure Memory, Spatial rela- 


tionships, and Reasoning. These are said to have been arrived at 
e the basic psychological 


Y means of factor analysis, and to b 
components of the batter Spearman has not undertaken to con- 
Struct a test battery for the measurement of G, so nobody can say 
at it would be like. However, Stoddard (1943) has suggested 
at it might well resemble the Stanford-Binet scale with certain 
Modifications, re rt ; 
"or a general appraisal of the over-all significance of factor 
analysis, the most crucial question that can be asked is this: Just 
lat are the factors that analysis discovers? Specifically, are they 
akin to actual psychological entities, or causes, or forces, as 
faculties and instincts were once supposed to be? That a correla- 
Jon Matrix can be factored is a mathematical truth. But this in 
Itself tells us nothing about what the obtained factors correspond 
i i ly theoretical question. It 


"his is a great deal more than a pure l 
appens that some counselors administer a factorial test, such 


as the Thurstone Primary Mental Abilities battery, work out the 
dicated factor profile, and then give very confident vocational 
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advice on the result, apparently in the belief that they are pro- 
i “scientific basis.” m 
Eoo a ae factor analysis has at least a cage 
er he to a return to the older faculty psychology. a 
on 4) has remarked that many of the factors that are beng 
pe cies: are given almost exactly the same names that ag A 
ago attributed to his mental faculties, Also, some very a 
guished workers in the field have in the past declared, almos a 
A many words, that factors are mental faculties under — 7 
name, the only difference being that faculties were arrived a 
priori while factors are arrived at by dint of statistical a 
Yet such a viewpoint seems hard to defend. As to the predic i 
value of a factor pattern, it has certainly not uniformly or ro 
versally proved of significance. In discussing the Sane 
Primary Mental Abilities battery, and other similar inate “| 
it was pointed out that they do not seem any more closely relate 


Pe i iliar 
to the usual criteria, €-g., Success in school, than the fam 
“global” tests. So, t 


00, Goodman (q.v.) and Ellison and ie Sa 
(q.v.) find little relationship between factor scores and aonan 
ment in school subjects that might be thought likely to be asso ) 
ated with them. Also, Stuit and Hudson (q.v.) and Adkins Nhi 
report that the factor patterns revealed by the Primary Menta 
Abilities battery have 


. arts : k d 
little relationship to vocational fitness a” 
choice. If the factors that are being 


operational constituents of the huma 
genuine “faculties”—then surely a 
would have a decisively greater pr 
global score, let us sa 


Ppear in them at all, or only very 
meagerly, 


‘ is? 
What, then, it may be asked, is the value of factor analysis 


The answer does not seem difficult. Any legitimate simplification, 
any rationalization, any orderi 


ng of a field of complex data pa 
many advantages, both theoretical and practical. When we deriv 
a measure of central tendency, or a measure of relationship, d 
find out something valuable abo e 


ut our data, considered as a fi of 
of order. When we show that our data can be handled in terms 


bs 


LL a pe 
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a few clear-cut concepts, something of great importance has been 
achieved. We need not believe that an average “exists,” or that a 
correlation “exists.” They are merely conceptual tools. But this 
does not detract from their importance both for thought and prac- 
tice. So also with mental factors. As Burt (1941, p. 97) puts it: 

W hat distinguishes factor analysis, therefore, from other ways of 
discovering how individuals and their numerous attributes can best 
be classified, is chiefly this: whereas the ancient logician reached 
his definitions by examining the meanings of words, the modern 
pntielg reaches his classifications by examining the correlations 

etween forms of behavior to which these words very loosely refer. 
But the ulterior object is still the same; and, whether we are 


describing persons or traits, the factorial concepts adopted are 
imply principles of classification.” 


Tappinc New PsYCHOLOGICAL PROCESSES 
A development of major importance in recent years has been 
ae Increasing use and investigation of projective tests, which are 
a aea towards psychological processes touched, only slightly 
unsatisfactorily or not at all by psychometric instruments. 

7 Students of projective testing tend to express themselves ina 
ery polemical fashion, to make extremely sweeping claims, and 
i cast much disparagement upon psychometric methods gener- 

= Thus Klopfer and Kelley (9-2+ P: 13) Say: “Out of the need 

a ridge the gap between merely subjective understanding of 

ee personality gained through clinical observation, and the 

a jective measurement of individual differences with little or no 

aderstanding of their origin or deeper meaning, there developed 

haa approach which may be described as in the above quota- 
n, by the term ‘projective methods of 


@Paport, Gill, and Schaefer (q.v.) intim 


personality diagnosis’. 

ate in effect that mental 
esti henge 
Sting has been held in a strait jacket 


largely by the prestige 
trast the work of the 


of : i 
k O° Bineizsenle and ees a To the disadvantage of 
"X. testers” wi ‘ective procedures o the di 5 
with proje p he new development noth- 


the f : 
the former. And Sargeant (q.v.) finds in the nen. 
8 less than a Alh from mechanistic to dynamic and 


holistic psychological interpretations. Perhaps such extreme con- 
“ntions are understandable in workers in a relatively strange and 
ne field which has not yet gained complete recognition and is 
“abject to many misunderstandings- How justifiable they are 
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we shall have to inquire after considering three of the most impor- 
tant projective instruments. But there is no doubt that projective 
methods are highly significant in opening up for investigation 
areas of mental life with which psychometric methods have not 
adequately dealt. 

As to the concept of projection, Rapaport, Gill, and Schaefer 
write: “In this sense a projection has occurred when the psycho- 
logical structure of the subject becomes palpable in his actions, 
choices, products, and creations” (Vol. IT, p. 7). The human 
mind, they point out, does not merely receive impressions, but 
always reacts to them in terms of its own characteristics. Thus 
any reaction, even that of simple perception or association, 15 
indicative of the personality, and may be considered a projection 
of it. Any reaction, that is, is determined not only by the object 
reacted to, but by the subject who reacts, and the characteristics 
of the subject are more or less revealed in it. The creative work 
of an artist is a projection and revelation of himself. So are the 
responses of a subject when he is asked to give free associations 
to a list of stimulus words, or to tell what story is suggested to 
him by a picture, or to say what he sees in cloud shapes or ink 
blots. This idea is the working basis of projective testing. 

Projection takes place in a vast number of varied situations, # 
great many of which have been used, with varying degrees of suc- 
cess, for clinical and diagnostic purposes. J. E. Bell (q.v.) in his 
extremely thorough account of the field lists storytelling, the 10- 
terpretation of cloud Pictures, the expression of likes and dislikes 
for photographs of faces, the analysis of handwriting, drawing, 
and painting, finger painting, picture completion, and vocal expres- 
sion among some of its manifestations which have been foun 
significant. All this in addition to the well-known and widely usé 
projective tests. 

The instruments themselves are ti 
ing, recording, and interpreting pr 
situation is set up which is as econ 
of not being time-consuming, as i 
for all subjects, and limited to 
examiner observes and records th 
prets them partly in the light of 
with. reference to an organized 
such instruments will now be b 


echniques for eliciting, observ" 
ojective responses. A stimulus 
omical as possible in the a 
mpersonal as possible, standar" 

one segment of behavior. The 


his clinical experience and partly 
body of interpretive data. Threé 
riefly described. 


e subject’s responses, and inter- , 


mam > 


$e ee 
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S : 
Word Association Test * 


ie is the oldest of well-known proj i 
sists PA report of its uses was made by Jung in 1919. It con- 
subj one list oi stimulus words which are presented orally to the 
dee” with the instruction to respond by saying the first word 
wows Hee to him. Jung originally used a list of 100 stimulus 
chant This list was revised by Rosanoff. Rapaport, Gill, and 
60, The have made a further revision, reducing the number to 
p e basis of choice was in favor of words that would tap many 
ive a ideation, conflict, and maladjustment. Many of the —_ 
notati amilial, domestic, oral, anal, aggressive, and sexual con- 
and lons. Also, many of them are nouns. The content and speed 
the general emotive characteristics of the subject’s responses to 
Words are the indications upon which interpretations are built. 
T if the response to “father” is “tyrant,” it would, taken 
action S with cther indications, be considered significant, JE re- 
Word = very slow and difficult, or 1 think of no 
hie all, these are considered signs 
istant Ssociations such as “house—my Speier on 
juncti associations such as “Jamp—turkey.” Such signs, : 
— with the whole clinical and personality picture, are use 
thes lagnostic criteria. Out of experience in the he see ie 
terial has been built up a considerable body of interpret 
al on which the examiner can draw to assist his diagnosis. 


g 


ective tests. The first 


2. : 
paii Apperception Test f 
®ach © material for this test consists of three 
One i One series is for both men and wom’ 
attit S for men. Most of the pictures show huma 
udes and relationships—approaching on? a 


series of ten pictures 
n, one is for women, 
n beings in various 
nother from a dis- 
Rare looking out of the window, etc.—or with marked “ti 
Tessions, The subject looks at the Fctures one by one. He is 
ed to tell the examiner what the situation represented is, what 
nts led up to it, what outcome is probable, what the thoughts 
feelings of the characters are. The examiner z. a ipa 
read; » as far as possible- complete; of what is sal F ; ae a 
ture, CSS of response, content of response, misrecogn! ie ha ch 
=a objects, and the aspect or portion of the picture on whl 
Re 


* 
t RegeTences: Rosanoff; Rapaport, Gill, 
ences: Murray, 1938, 1942- 


eve 


and Schaefer; Jung. 
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the subject centers his attention are among the indicative a 
Thus, if a response to a picture of a boy sitting with a Maen 
before him on the table is that he is being tyrannized over by ‘ 
parents and will soon capitulate although unhappy and a 
this begins to suggest a certain personality orientation. In A 
the “story” in all its aspects which the subject produces is treate 

as a projective manifestation, and so is failure to produce any 
“story” response at all. As with the Word Association Test, d 
body of interpretive material has been assembled and analyze 

for the assistance of the examiner in arriving at his diagnosis. 


3. The Rorschach Test + 


This is the best-known and m 
jective tests. Particularl 
(q.v.) began periodical 
and investigations of th 
amounts of interpretiy 


The test itself consists of 10 large ink blots, 5 in different shades 


It is necessary to conduct t : 
subject’s full attention. eon 
ave been developed, with te 
a screen (Harrower-Erickson; Harrower 
Erickson and Steiner; Munroe, 1942). Apparently group admin- 
istration is fairly satisfactory, although there are some doubts: 

The subject, of course, makes a verbal response, telling what 
he sees in the ink blot. Each response is scored with reference t 
5 categories.* These have to do, not with the direct content 0 


what the subjects sees, but with the mode of his seeing. (1) The 
first has to do with the “location, 


of the response. It may be to the 
detail, or to its small detail. (2) 
with the content of the response, 
a certain type. A response may 
concepts, nature and geography, 
and abstract concepts. ( 


cb 
} among scoring plans. A brief statement r 
as this cannot take cognizance of them all. The present account is based UP 
Beck, Klopfer and Kelley, and Rapaport, Gill and Schaefer. 
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d . 
oa of the response, i.e., the elements in the stimulus 
atime prepotent. The response may involve human-like, 
tke os ey or minor movement. It may be to the shading values, 
P haan og or the form of the ink blot. (4) The fourth 
cepin a as to do with the form level of the response. In con- 
accirate a eote may range from vague and arbitrary to 
palant A (5) The fifth category has to do with the 
typica? i of the response. It may range from very common and 
tail rh esponses often made to very rare and idiosyncratic. De- 
o instance, may be “normal or “unusual.” In addition to 
tinted = these five categories, the total number of responses is 
ratto ymbols are used by the examiner in classifying each 
ihes n on the system: described. Interpretations are based upon 
€ categorized scores. 


a bing idea may be given 
es. According to Beck 


of the interpretations based on the 
expected c (q.0.) a normal individual may be 
Pattern to give 31 responses in all. Of these, 6 will be to whole 
unusy S, 21 to normal or ordinarily selected detail, and 4 to 
ial detail. Deviations in the direction of more whole-wise 
broad generalization 


respo 
Ponses may suggest a tendency towards 
gest an expansive per- 


and 
Sia T When extreme, they may sug} 
ity neglectful of detail. Deviation.towards more detail re- 
ly to attend to concrete 


Spo Pai 
Ponses indicates a personality type like 
ractically. When extreme 


Te and to approach problems Pp h 
minda e pedantry, meticulousness, and overcaution. Feeble- 
and t persons usually cannot see the blots as meaningful wholes, 
n to interpret the parts In a stereotyped and obvious 
imagi n in terms of common objects. Interpretations mn terms of 
ness. a movement may indicate fantasy, delusion, or inventive- 
and mphasis on color may suggest impulsiveness, egocentricity, 
Steady tionalism. Emphasis on form indicates | intellectualism, 
€ eaaa and perhaps introversion. The prevailing responses of 
colas ormal well-adjusted person are 10 terms of form, though 
tespo and shading are usually mentioned. Repetition of the same 
mind nse to different cards may indicate stereotypy as 1n feeble- 
, undedness, The ratio of usual to unusual responses is often an 
Sega of originality. This may give the reader at least some 
impo notion of the type of interpretation yielded by this very 
ratio ant test. As an additional instructive and interesting illus- 
29 Me. Bleuler and Bleuler (q-2-) gave the Rorschach Test to 
roccan peasants, and compared their responses to those of 
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i d 
. The Moroccans gave few integrate 
gaun, o fantastic interpretations of Slee 
eae ol the patterns. There were few abstractive genera ea 
ma Their qualitative responses compared to those of ee 
a when made to shape only, to color only, and ae 
other respects. The authors find these test responses compa 
with the general mode of life of these people. M 
Since the Rorschach Test has become widely used in A aie 
the question of standardizing it has arisen. Rorschach — 
was opposed to this, and Rapaport (1939) is critical of th ail 
gestion. Some attempts, however, have been made. Hertz ee 
for example, has worked out norms on 300 adolescents in j ni 
high school for the various Rorschach categories, and com tet 
these to norms obtained with other groups. Three years watt 
(1938) she published Scoring lists for the Normal Detail categ 


fot? : setribution 
worked out Statistically on the assumption of normal distributio”, 
instead of being accumulated 


other lists. But such work has b 
in principle by many experts. Bell 
Reliability, too, is by no means well determined. J. E- ee 
(pp. 132-35) has summarized a considerable number of es 
sentative reliability studies, using the test-retest, split-hal , ive. 
matching methods, but the general outcomes do not seem decis 


jes 0 
, the question of validity is primary, Formal studies 
validity have compared Rors 


ardization and reliability involve 


ro- 
debate that goes to the heart of projective testing. Should P 
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tee be standardized? Should definite norms be worked 
and ņ hich would more or less “automatically” interpret the varied 
is multiple responses elicited? It is very doubtful. Interpreta- 
n depends upon the total picture, which must be assessed by 
ome and experience of the examiner. The standardization 
sane technique would seem legitimate only where the test re- 
may es themselves are channelized. So also with reliability, this 
testi not have the same value in projective as 1n psychometric 
ni we As we have seen, it pertains not to the test alone, but also 
largely examiner, the subject, and the total setting. It depends 
ae y on channelization, and this the projective tests avoid as 
once may be. What in a psychometric test would be variable 
eae here may be important indicators which ought to be 
a ed. The more freely, within limits, the subjects responds, the 
Stand the chance of the examiner to reach a significant under- 
6H ing. This may even mean that the reorganization of the 
toe Test for group use is a mistaken development. Pro- 
ive testing should be free to develop its own techniques and 


procedures, and if it becomes assimilated to psychometric testing, 
will be destroyed and 


ies a 1s danger that its distinctive values 
tary lien contribution lost. ‘There is, of course, a comp emen- 
Sort Pep for if projective testing pus es a cue | 
Worth], adequate controls, which is entirely a q Ltt 
ant, es and trashy instruments will be produced, a A z 

ta astic interpretations will be broadcast without a y ia 
mues Apparently projective testing in the immecia e future 
St steer a course between charlantry and an alien pseudo- 


Scientific rigidity. s 
alist, Projective tests are exp ed upon a aes 
Simpl, Psychology. Perception, it 1s p d out, leno 5 
Ratho, by the impact of external objects upon the AT T 
off td it is an interpretive and purposive reorganiza ion ci a 
mec, Such impacts. In the same yay ee actions, 
anical establishment or grinding 10 Gh nena coni E 
€r it is again a process of purposive, meaningful organization. 
at a person perceives and how he perceives it, what experiences 
his SS0Ciates together and how he associates them, depend Taon 
cany petal oF personality organization, a ar DETU z a 
chological i ions < $ 
Weie not a Aa = this viewpoint, although they im- 


* 
S n 
ee Wolfgang Koehler, Principles oj Gestalt Psychology. 


licitly bas 
is pointe 
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` y to think of the test as a 
pisam moma cae pa ee hee or active associations, 
means of discovering indicative of tionally potent expe- 

i vere related to and indicative of emotionally pi : 
ona: 1 f projective testing think of any 
riences. But today students of proj ive tent E, ik sentated 
response as resulting from and thus manifesting fet ot 
experience or complex, but the total personality organize rojective 

The tests described are simply devices for eliciting p he ati 
and indicative responses which are practically manageab 4 the 
scorable. The difference between the Rorschach Test A eae 
Thematic Apperception Test lies in the degree of explici TE 
turalization in the stimulus. The latter controls and pre fully 
response more than the former, and is more apt to elici limi- 
conscious and relatively superficial constructions. The same a 
tation also applies to the Word Association Test. A one oe 
which applies to all three is the very artificial character p me 
stimulus situations. They are not substitutes for the study 0 der 
subject’s behavior and of its projective manifestations in W ið 
and more normal settings. A man’s reaction to ink blots gen 
Some extent show what sort of person he is, and it has the ad a 
tage of being convenient and scorable. But it is of course a SM 


g 
sample, and may be a distorted sample of his much more revealing 
reaction to the concerns of daily life. . the 

So far as psychological assumptions and orientation g0, 
chief diffe 


shat 

rence between projective and psychometric tests 15 uke 

in the former these are explicit and that in the latter they are tive 

A psychometric test is committed to measurement. A projec 

test is committed to diagnosis. To some extent the two eat > 

combined, for there are projective elements in many pgychome an 
» and psychometric elements in projective tests. But 


e both values in one instrument, as f° 
instance by standardizin sative 
only in both being lost. The scorn which students of projèt y 
techniques are so free to pour upon psychometrics is particu s 
focused upon the ordina 


self-ratings on specified 


j} h - than 
questions cannot possibly be other = 
stereotyped and superficial, Yet when a personality scale is sou! 
built on well-chosen conc 


Minnesota Multiphasic Perso 


ar 
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i ee psyc bomerig tests are by no manner of means committed 
aid os e a psychology. They are merely committed to good 
yield or able analyzing concepts. In terms of these concepts they 
Sal measurements which are reasonably definite, which is no 
all advantage and which obviously can be and should be inter- 
oe cote terms of the total personality and setting of the subject. 
er ive testing is no Copernican revolution in mental measure- 
Tee some would appear to suppose. But it is a valuable new 
ie e a whose future authenticity will depend upon avoid- 
Are Th E twin dangers noted above. Measurement and diagnosis 
Mitua two aims of all mental testing. They are by no means 
devel y independent, and future progress will turn upon the 
opment of better instruments for their achievement. è 
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TR N. Freeman, Mental tests: their history, principles, 
ie oe (Rev. ed.; Boston: Houghton Mifflin Company, 
ict 11, “Technique and theory of mental tests. Ii, Pe 
nie ng to scores and norms” (a treatment of various scores a 
asures); Chapter 16, “The nature of mental ability” (a discussion 


o factor analysis). 

eee E. Vernon, The measurement of abilities (London: Univer- 
y of London Press, Ltd., 1940), Chapter 8, “Analysis of abilities.” 
n exceptionally clear treatment of factor theories. 

te Gertrude Rand, “A discussion of the quotient method of specifying 
st results,” Journal of educational psychology, 16 (1925), 599-618. 


n analysis of the I.Q. and the C.I. 
Joey runo Klopfer and Douglas McGlashan Kelley, The Rorschach 
inique (Yonkers: World Book Co., 1942), Chapter 1, “History of 

2e Rorschach method”; Chapter 2, “Methodological problems.” 
eine atic and historical account of the general aspects of projective 
ei Sargeant, “Projective methods: Their origin, theory, and ap- 
Cation in personality research,” Psychological bulletin, 42 (1945), 


2 hon 4 
57-93. An over-all summary and documentation. 


and ap- 


1939); 
‘oblems 
s and 


QUESTIONS FOR Discussion 


ies Would it be true to say that unless the LQ. were a fairly stable 
asure, the problem of the constancy of obtained 1.Q.’s could not 


e 
ven be approached? i 
f using purely statistical norms and 


iea What is the advantage 0 
sures? Are such norms and measures without psychological 


reference? 
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respect are the problems in applied psychology in mak- 

ee aren purpose test different from those in making a general 
i igence test? aust: 
re the advantages and disadvantages of standardization 0 
functional groups rather than on unselected populations. sane 

5. If you agree with Terman that the LQ. involves no eompeer 
regarding the form of the growth curve, does this mean that the ra 
of mental growth does not affect an actual obtained I.Q.? 

6. Can you think of any other vocations or activities, besides truck- 
driving aptitude, which might be difficult or impossible to determine 
by tests? Why does such difficulty 


arise? 
7. Have you ever hear 


d or read popular psychological discussions 
which seem to identify th 


e sort of entities or concepts discovered in 
factor analysis? 
8. Should the apparent 


given factor theory affect its acceptability? R 
9. Do we ever place any reliance on projective manifestations in 
judging people in everyday life? 
10. Does the admittedly analytic character of psychometric tests 
really imply a far-reaching Psychological orientation? 


reasonableness on general grounds of any 
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Atomistic Psychology. See Mechan- 
istic Psychology 

Attitude, 257-58, 286; intensity of. 
285; measures of, 283 ff. 

Attitude Scales, generalized, 285-863 
Specific, 283-85 , 

Aviation cadet qualifying examina- 
tion, 405 

~ . 

Bernreuter Personality Inventory, 
263-66, 267, 399-400, 413 

Binet scale, ro, 38, 39, 40, 43, 6% 
84, 93, 126, 140, 412 

Binet tests 74-75 
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Binet's practice, 97 ff., 118 f., 129 f 
137 f. 

Bogardus Fatigue Test, 228 

Briggs Analogies Test, 245 

Brightness, 106-7 

British Mental Deficiency Act, 380 

Brown Spool-Packer Test, 228 


C score, 178-80 
California First Year Mental Scale, 
ë 96, 185, 190-91 
alifornia Preschool Mental Scale, 
é 184-85, 192 
alifornia Test of Personality, 261- 
" 62, 407 
alifornia Tests of Mental Matur- 
es 36, 178, 207, 213-15, 402% 
13 
Ganal-boat children, 301, 304 
— questions, 48 
nsus groups, 33 
ae PS, 333, 403 
+ ea measures of, 289 ff. 
Clioraster Education Inquiry, 289 
— Non-Verbal Examination, 
Chicago Tests of Primary Mental 
Abilities, 208-12, 400, 412-13, 
414 
Choice, 14-15 
Sri age, 106-7, 195, 700 
a urch, attitude toward, 283-84 
Clapp-Young Self-Marking Device, 
154 
aires value, 172 
isthe of intelligence, 394-95 
ommitment, institutional, 43» 135) 
€ 136, 198 
oo iunily setting, 298-301 
eens tests, 6 
-omplex learnings, 14 
Complexity, 8r 
Sea a 8i 
oncept, working, 24-25) 26, 34 
35-37, 41, 61, 69, 72-77 9% 


9, 254- 
T 


3 
87, 291- 


Ds 
p 
B on 


92, 375-76, 405-6 
Concepts, formulation of, 14; new 
working, 404 ff. 
Conduct, 29° 
Configurational psychology, 23, 25- 
26, 74-75, 80, 93, 410 
Constancy of mental traits, 330 f., 
374. See also 1.Q., constancy of 
Cooperative tendencies, 289 
Cornell-Coxe Performance Ability 
Scale, 171-72 
Correlation matrix, 409-10 
Creative output, 359 
Criteria, 39-45, 198 
Cultural factors, 330 
Cumulative effects, 30374 
316-17, 322 
Curriculum, vital, 324 
Curtis Test of Arithmetic Achieve- 


ment, 332 


311, 312, 


Delinquency, 384 

Detroit Clerical Aptitu 
tion, 238 

Detroit General Aptitudes Examina- 
tion, 237, 255, 3997409 

Detroit Mechanical Aptitudes Ex- 
amination, 238 

Detroit Scale for Diagnosis of Be- 
havior Problems, 269-70 

Detroit Tests of Learning Aptitude, 
136-37. 140, 40° 

Developmental Examination, 188- 
go, 194 

Developmental Sche 


des Examina- 


dules, 188-89, 


351 
Developmental score, 192-93 
Diagnosis, 43, 115, 128-29, 203, 

268-69, 420 
Difficulty, 80-81, 82, 83, 137) 216- 

17 
Distribution, of mental traits, 


364 ff.; of scores, 220-21 
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Drake Test of Musical Memory, 61, 
250 

eiia Man Test, 59-60, 174- 
75, 300, 355 


Economy, 81 

Educational achievement, 116-18 
221, 222, 239, 241-42, 242-43, 
315 

Educational promise, 382 

Educational tests, 4-7. See also 
Achievement tests 

Efficiency, 215-16 

Electrical engineers, 405 

Emotional blockages, 81 

Emotional disturbances, gg, 112,172 

Environment, 49, 150, 174-75, 195, 
304, 311, 312-13, 314, 328, 333, 
340, 347, 350, 353; analysis of, 
376-77 

Equality of units, 138 

Equivalence, rational, 52-54 

Error, constant, 29-30, 45; 


of in- 
terpretation, 32-33 ; 


personal, 
31-32, variable, 30-31, 46 


Estimating intelligence, 90-92 
Experiment, Psychological, 2-4 


Face validity, 44 

Factor analysis, 3-4, 22, 26, 36, 39, 
77, 93, 115, 137, 194, 199, 209- 
12, 213, 217, 229, 235, 258, 262, 
265, 275, 400, 407-15 

Factorial purity, 26, 11 5-; 
18, 217 

Factorial validity. See Validity, fac- 
torial 

Factors, 3-4, 26, 36, 72, 87, 147, 
166, 212, 213-15, 406-7, 413-15; 
group, 411; second order, 411-12; 
special, 411 

Faculties, 73, 75, 225, 414 

Faculty psychology, 26, 93 

Familiarity, 48-49 

Family, 305 ff., 369, 373 


16, 117- 


Feeble-mindedness, 378, 379, 380 
Financial rewards. See Income 
Finger Dexterity Test, 22 

First grade entry, 106, 391 

Form board, 168 

Formal discipline, 74 

Foster home, 306-12, 377, 404-7 


General factor, 4, 80, 115, 116, 213, 
410-11 

General intelligence, 35, Ch. 1, 
133-34, 175, 197-98, 200-1, 210, 
222, 230-31, 232, 233, 235) 2395 
definitions of, 77 ff.; descriptions 
of, 80 ff. n 

Genius, 297, 320-21, 378, 379, 38% 
384-85 

Gifted children, 347-48, 380-2 384 

Global score, 90, 93, 128, 159 al 
8, 210-11, 213, 217, 241, 35% 
390-99 

Grade norms, 154-55 

Group norms, 402 

Group tests, 10 

Growth. See Linguistic develop- 
ment; Mental growth; Motor 
development 


Haggerty Intelligence Examination 
Delta, 151, 220, 207 : 
Haggerty-Olsen-Wickman Behavior 

Rating Schedules, 270 
Hand Tool Dexterity Test, 234 
Harlem, 332 
Height, 366 
Henmon-Nelson Test of Mental 
Ability, 34, 38, 152-55 
Heredity, 49, 79, 81, 246, 314, 328 
332, 340 371, 373 ff. 
erring-Binet Scale, 118-20, 161 
Higher mental processes, 16, 23 
“Hollow” folk, 300-1, 303-4 
ome Inventory Scale, 310-11 
Humm-Wadsworth Temperament 
Scale, 49-50, 266-67 
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Ideational Learning Test, 86 


Idiot, 380 
w ane Intelligence Examination, 
21 


Imbecile, 380 
Income, 305, 376-77 
Index of brightness, 395 
Indians, American, 59-60, 174-75» 
328, 330, 332 
Individual differences, 382-83 
Individual tests, 9-10, 99 
| Infant speech, 352 
Infant tests. See Young children, 
tests for 
Inhibition, 291 
LER, Arithmetic Test, 325 
LER. Assembly Test for Girls, 
233 
LE.R. Intelligence Scale CAVD, 
35, 38, 47, 85, 90, 137-40) 20% 
217, 347, 381 
Institutional setting, 324 
Intellect CAVD, 139 
Intelligence, estimating, 90-9? 
Intelligence quotient, 106-7; 
II, 126, 127, 132, 133, 156-57) 
158, 165, 167, 184, 200, 216; 
changes in 345, 3473 classification 
of, 378-79; constancy of, 105-6, 
191-92, 317, 339740, 341-47) 
394; early, 193, 346-475 gains, 
318-19, 323; high, in eminent 
persons, 384-85; meaningfulness 
of, 393-94; stability of, 108-9 
133, 345, 392-93 
Intelligence tests, 41, 42) Ch. V, Ch. 
VI; appraisal of, 215 ff., 220-22) 
for high school and college, 
161 ff.; special purpose, 239 240, 
244, 372; trends in, 215-17; VO" 
rs cational uses of, 208 
} vinterest, 272, 348-49; ® 
273; measures of, 41, 
manence of, 274 
v Interest groups, 2747-75 


108- 


nd success, 
272ff.; per- 


Interest Questionnaire for High 


School Students, 277 

Interform coefficient, 52 

International Intelligence Test, 312 

Interrelations of tests, 40-41, 60- 
61, 69, 134-35, 144 145, 154 
155, 158, 159) 160, 161, 169-70, 
171, 171-72, 174; 183-84, 194, 
218-20, 230-31, 303, 391 

Interval Discrimination Test, 249- 
50 

Iowa Tests for Young Children, 
185-87 

Jowa University Placement Ex- 
amination, 6, 206, 240741, 244, 


372, 493 


Kaffirs, 69 
Koerth Pursuit Test, 228 
Kohs Block Design Test, 170 
vKuder Preference Record, 280-83 
Kuhlmann-Anderson Test, 161, 219, 
221 
Kuhlmann-Binet Scale, 121-22, 221, 
297; 310, 345; 346 
Language, 329-3° 
Language tests, 89 
Latin prognosis, 245-46 
Latin Prognosis Test, 226, 245, 254 
Law Aptitude Examination, 240 
Law Aptitude Test, 240 
Learning, 85-7, 92, 137) 349, 353 
Length of test, 46-47 
Level, 35- See also altitude 
“Likert” technique, 284 
Linguistic development, 351, 35% 


355 2 
Logical Decision Test, 270-71 


McAdory Art Test, 251 
McCall Multi-Mental Scale, 220 


Mannikin Test, 168 
Mare and Foal Test, 168 
Marks, 42, 2215 241-42 
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Mathematics Aptitude Test, 241 
Maze Test, 170, 175, 357 
Mazes, 
nt conditions of, 29 ff.; 
logic of, 33-34, 68-70, 259-60, 
335, 364-72, 374-76, 401-3 
Measures of Musical Talent, 41, 
55, 61, 226, 247-49, 399 
Mechanical aptitude, 226, 229; 
tests of, 41, 220 ff. 
Mechanistic psychology, 22-23, 25- 
26, 84 
Medical Aptitude Test, 19-20, 42, 
234-40, 244, 372 
Meier Art Test, 252-53 
Meier-Seashore Art Judgment Test, 
252 
Memory, 258-59; musical, 250 
Mental age, 61, 67-68, 103-6, 108- 
II, 131-32, 149-50, 156-57, 165, 
167, IJI, 200, 327, 352, 383, 
390-91; of population, 149-50, 
197 
Mental decline, 358 
Mental growth, 104-5, 107, 108, 
123-25, 127, 128, 188, 194-95, 
350 fi., 394; adult, 354-59; curve 
of, 123, 125-26, 161, 359-63, 
391, 395; early, 351-53 
Mental maturity, 100 
Mental processes, 8-9 
Mental units, 125, 16r 
Mentality, 294-98; waste of, 328 
Mercator Projection, 32-33, 364, 
368, 390, 391 
Merrill-Palmer Scale, 180-84, 
393, 394 
Methods of work, 15 
Metropolitan Readiness Tests, 244 
Migration, 301-2, 332-33 
Miles Drill Test, 228 
Miller Mental Ability Test, 219 
Miner’s Analysis of Work Interest, 
280 
Minnesota Home Index, 377 


194, 


Minnesota Intevest Analysis Test, 
232 

Minnesota 
Test, 232 Mane 

Minnesota Multiphasic Personality 
Inventory, 267-69, 271, 283, 399 
422 3 

Minnesota Paper Form Board, 232 

Minnesota Preschool Scale, 1787 
80 . 

Minnesota Rate of Manipulation 
Test, 22 

Minnesota Spatial Relations Test. 
232, 233 

Minnesota Vocational Test x 
Clerical Workers, 226, 238-39 fi 

Moral conduct, measures of, 289" 

Moral knowledge, 289-90 

Moral opinion, 291 

Moron, 380 

Motivation, 5 f 

Motor ability, 228-29; measures 0t. 
226 ff. 

Motor Achievement Test, 187-88 

Motor age, 188 

Motor development, 188, 351 

Musical Memory Test, 250 

Musical persons, 259 


Mechanical Assembly 


National Intelligence Tests, 4, 1517 
2, 153, 218, 220, 229, 300, 361 
Negativism, 194 
Negroes, 328-29, 330, 331, 332, 380 
Non-language tests, 9 
Non-profit organizations, 166-67 
Non-racial factors, 330-32 
Non-verbal tests, 9 
Normal distribution, 65, 91-92, 
364 ff., 398 
Ormality, assumption of, 367-72 
Normalization, 368-69 
Northumberland, 297-98, 301-2 


Objectivity, 32, so ff. 
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Observation, 35-36, 366 

Occupational Orientation Inquiry, 
279-80 

Occupations, 198-99, 283, 295-98 

Ohio State University Psychological 
Test, 165-66, 221 

Organization of mind, 195 

Originality, 15-16, 81 

O'Rourke Mechanical 
Test, 65, 234-35 

Orphanage, 323-24 

Otis Group Intelligence Test, 155, 
219, 220, 221, 222 

Otis Quick-Scoring Mental Ability 
Tests, 158-59, 202, 388 

Otis Self Administering Test of 
Mental Ability, 42, 65-66, 114 
144, 155-58, 203, 217, 219) 220, 
221, 311, 312, 325, 332, 354 


Aptitude 


kar placement, 186-87, 200 
ercent of average, 126, 395-96; 
stability of, 127, 128 
Percentile norms, 154 
Percentile scores, 65-66 
Percentiles, 396 
Performance tests, 9, 89, 167 f» 
194, 300, 301, 302-3, 330 355» 
391; values of, 171-72, 175 
Persistence, 291 
Personal constant. See Percent of 
op average 
Personality, 258, 260, 262-63, 407; 
total, 384-85; types, 87-99 258- 
59, 266, 267 
pay quotient, 253 
tier Quotient Test, 262-63 
ersonality tests, 43, 260 ff, 406; 
evaluation of, 271-72 
Phrenology, 20, 73 
Pintner-Cunningham Primary Test, 
p59 219, 244, 300 
intner General Ability Tests, 244; 
non-language series, 172-74 329° 
verbal series, 159 
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Pintner Intelligence Test, 159 

Pintner Non-Language Group Test, 
330, 393, 394 

Pintner-Paterson Scale of Perform- 
ance Tests, 168-70, 219 

Pintner-Paterson Short Scale, 332 

Pintner Rapid Survey Test, 299 

Point Scale for the Measurement of 
Intelligence, 118 

Point Scale of Performance Tests, 
170-71 

Point scales, 89, 114, 118 ff.; values 
of, 120, 129 

Power, 147-49 

Practical problems, 15 

Practice effect, 322, 346 

Practice material, 49-50, 152 165 

Prediction, 39° 

Preschool, 315-24, 353 

Pressey Interest-Attitude Tests, 
288-89 

Pressey Primary Scale, 303 

Primary mental abilities, 281, 493, 
411-12. See also Factor analysis, 
Factors 

Probability, 366-67 

Probable error, 132 

Professional and academic aptitude 
tests, 239 ff. 

Profile scores, 93; 159, 165, 167, 
207-8, 24%, 213-%4, 277 248, 


255, 261-62, 399-400 


«Profiles, 44, 73, 383 


Progressive education, 23-24 
Projection, 416 
Projective tests, 7 
257, 415 f. 

Psychoanalysis, 25 
Psychological theory, 
Psychometric tests, 7-8 
Psychotics, 88, 89, 116 


-8, 13, 23) 29 


22-26 


and quality, as aspect of 


Quantity 
] test scores, 383- 


psychologica 
85 
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Race, eae is 
ial purity, 

ie Operators, Test for, 
404-5 

Range of difficulty, 47 

Range of intellect, 35, 82-83, 84, 
99, 137, 139 

Rapport, 50, 331 

Rational Learning Test, 85 

Raw scores, 62, 63 

Reality of traits, 36-37, 75-76 

Reasoning, 359 

Recall, immediate, 3 

Reconstruction, 81-82 


Reliability, 31, 45 ff.; coefficient of, 


5I, 55-56, See also Interform 
coefficient, Retest coefficient, 
Split-half coefficient; degree of, 


57-59; recording of, 51-56. 

Retardation, 352, 375 

Retest Coefficient, 52 

Retroactive inhibition, 25 

Revised Stanford-Binet Scale, ad. 
ministration of, 106; Criticisms 
of, 108-18, reliability of, I09-11; 
scaling of, 103-6; scoring of, 
106-7, standardization of, 103-6. 
See also Stanford-Binet revisions, 

Rogers Interpolation Test, 245 

Rorschach Test, 418-21 

Sample, 64-65 

Scalability, 285 

Scales, 96 

Scaling, to, 47 

Scatter Pattern, 89 

School achievement, See Educa- 
tional achievement 

School continuance, 315 

School environment, 197-98 

Schooling, 150, 331, 373; effects of 
later, 325-27; and mentality, 
315 ff. 

Science Research Associates (Test) 
of Primary Mental Abilities, 212- 
13 
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Scores, 
78 ff. 

io, social meaning of. 380- 
8 Pi 

Store stability and meaningful- 
ness of, 389 ff. 

Scores, Mie of. See Global scores, 
Percentile scores, Profile scores, 
Raw scores, Standard scores z 

Scoring devices, 154, 158, 164, 215° 
16 

Scrambled organization, 144, 154 
156, 215 s% 

Screening tests, 206, 216, 405- 

Singing, 352 g 

ehore Masies of Muse 
Talent, See Measures of Musica 
Talent 28 

Seashore Motor Rhythm Test, 228, 

Serum Form Board, 168 

Selection, 301-2, 315 

Set, mental, 49 

Sex differences, 386 

Single-item tests, 216 

Skin color, 329 

Social Significance, 81, 82 

Socio-economic factors, 294 ff., 330- 
3I 

Socio-economic scales, 376 

Socio-economic status, 373 

Spearman-Brown Prophecy For- 
mula, 46-47, 53 

pecial abilities, 235-36 

Speed, 35, 81, 83, 84, 137, 147-49» 
169, 201, 211, 331-32, 357 

Spiral-omnibus, See Scrambled or- 
ganization 

Split-half Coefficient, 53 g 

Stability, 389-90. See also Intelli- 
gence quotient, stability of; Per 
cent of average, stability of 

Standard deviation, 66 


Standard deviation scores, 396 
Standard error, 54 


Significance of, 363-64, 
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sina scores, 66-67, 132 
tandardization, 33, 61 ff., 197, 216- 
17, 219-20, 312, 335, 3 š 
i ie 20, 312, 335, 364-65, 
Fei a pr group, 64-65, 67 
tanford Achievement Test, 221, 
312 
Stanford-Binet 
106 
Stanford-Binet revisions, 17-18, 59, 
7, 85-86, 88, 90, 97-118, 122 
123, 125, 126, 135-36, 145, 1497 
50, 161, 169, 171, 174, 178, 183, 
T9I~92, 194, 200, 202, 218, 219, 
220, 221, 296, 300, 302, 310, 312, 
Se: ae Ree ies Se F 
346, 347, 353, 355» 362-63, 371) 
a 394, 402, 413. See also Re- 
Has Stanford-Binet Scale, Stan- 
St Ord Revision of Binet Scale 
anford Later Maturity Study, 
see 
Stanford Motor Skills Test, 227- 
S = 
ners Revision of Binet Scale, 
standardization groups, 100-2; 
Be ata of, 116; valida- 
=, of, 102-3, See also Stanford- 
si inet revisions 
anford Scientific Aptitude Test, 
s 253-54 
rian i measures, 396-98 
aa Assembly Test, 230-31) 
S : 
iy Measures of Mechanical 
Sta ptitude, 42, 234 
S udy of Values, 286-87 
ee 48-50 
oe 2; intercorrelation of, 147 
SS, vocational 
S s 43, 44 
ummated ratings, 284 
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T: 
nan, 225-26, 246-47, 253 
ent tests, 41, 246 ff. 
€acher’s ratings, 42 
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Temperament, 257, 258 
Terman Group Test of Mental 
Ability, 42, 148, 161-52, 163, 


219, 220, 221, 388 

Terman-McNemar Test of Mental 
Ability, 162-63 

Test instructions, 50 

Test items, 2, 9, 10-14, 34, 37-39, 
48, 201, 217; independence of, 
47-48 

Test of Mechanical Comprehension, 
236 

Test of Public Opinion, 288 

Tests, early, 73; emerging types of, 
208 ff.; improvement of, Gh AI 
363-64; limitations of, 10 ff., 14- 
16; nature of, 1-2; types of, 7 ff.; 
values of, 10 ff., 16-22, 24, 45; 
150, 217-18 

Tests of Fundamental Abilities of 
Visual Arts, 250-51 

Tests of Mental Development, 10, 
122-28 

“Thematic Apperception Test, 417 

Thort:dike Intelligence Examination 
for High School Graduates, 19, 
163, 202, 228, 235 

Thorndike-McCall Reading Scale, 


as 
Thomdike Test of Word Knowl- 
edge, 245 
“Thurstone” technique, 
Time between testings, 
Trabue Language Comp 
145 
Trait, 225-26, 257 265 
Truck drivers, 405 
True-false tests, 60 
Trustworthiness, 35, 38 
Tweezer Dexterity Test, 227 


Twins, 312-14, 377 


283, 285 
345-46 
letion Scale, 


med Forces Institute Tests 


U. S. An 
ional Develop- 


of General Educat: 
ment, 5-6, 205-6 
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Units of measurement, 388-89 

Universal test, 69, 402 

University of Iowa, 319 

University of Minnesota, 242-43 

Unreliability, causes of, 46-51 

Urban-rural differences, 298, 301, 
302-3 


Validity, 30, 34 ff., 44-45, 166, 217, 
228-29, 248-49, 254, 388-89, 
407; establishment of, 39-45; 


factorial, 36, 39; practical, 39 
Value, 258 


Variability, 123 

Variable error, 127 

Variance, 53 

Verbal tests, 302-3, 330, 391; cf. 
performance tests, 219-20 

Vocabulary test, 38, 59, 111-13, 


193, 355-57 
Vocational aptitudes, tests for, 
236 ff. ? 
~ Vocational Interest Blank for Men, 
43) 277-79, 280, 281, 413. 
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Women, 279 


Waste of mentality, 328 

Wechsler-Bellevue Scale, it 935 
107, 129-36, 137, 140, 178, 202- 
n ae 400; form B, 
I31, 135-36 

Weight, 366 

Whole-part learning, 86 

Winnetka, Ill., 319, 322 zë 

Wonderlic Personnel Test, 203, 206, 
216 

Word Association Test, 417 

Work sample tests, 226, 245 465 

World War I, 142-51, 197, 295 
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World War II, 20, 43-45, 205-6, 
216, 295, 404-7 


Young children, tests for, 178 ff., 
190-96, 321 


Zero intelligence, 139, 362 
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